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Fig. 13. Exact numerical results for the repository size required to detect association is shown 
as a function of the allele frequency p for (A) dominant inheritance, (B) additive inheritance, 
and (C) recessive inheritance for tests using pooled DNA. The variance ratio csa \ Igr is 0.02, 
5 the type I error is 5x 10~ 8 , the type II error is 0.2, the pooling fraction 0.27 is used for all 

designs except Mahal anobis, for which 0,188 is used. The Mahalanobis design loses power for 
rare alleles faster than the other designs. 

Fig. 14. Exact numerical results for the repository size required to detect association is shown 
10 as a function of the hctcroz ygote phenotypic displacement ^describing the inheritance mode, 
for allele frequencies of (A)/? - 0.5, (B) p = 0.25, and (C) p - 0.1 for tests using pooled 
DNA. All other parameters are as in Fig. 13. 

Fig. 15 The repository size required to detect association for a QTL for a complex trait is 
15 shown for pooled DNA designs relative to individual genotyping designs having equivalent 
type I and type II error rales. The ratio A^g/unaffWindiv for affected/unaffected pools (dashed 
line) is shown as a function the disease prevalence /*, while the ratio Ntus/N'm&vt (solid line) is 
shown as a function of the fraction p of the total population selected for each pool. The 
optimum value of N^/N^n is 1 .24 and occurs at p = 27.03% selected for each pool. 

20 

Fig. 16 The effect of varying the inheritance mode is shown for tail pools. The type I error is 
5x10 , the type II error rate is 0.2, and the displacement a is 0.25 in units of the phenotypic 
standard deviation. The displacement d of heterozygotes varies from -a, pure recessive 
inheritance, to +a, pure dominant inheritance. Three allele frequencies are shown,/? = 0.5, 0.1, 
25 and 0.01. Solid lines correspond to exact numerical calculations. (Top) The repository size N 
is shown. Filled circles corresponding to analytical approximations, Bq. 1, are virtually 
^distinguishable from exact calculations. (Bottom) The optimal pooling fraction p from 
numerical calculations falls in a narrow range from 24.5% to 27.5%, close to the analytical 
approximation of 27.03%. 
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Fig. 17 (Top) Exact numerical results for the repository size TNT required to achieve a type I 
error rate of 5x1 0~ 8 and type II error rate of 0.2 are shown for affected/unaffected pools 
(dashed line) and tail pools (solid line) as a function of the additive variance, also presented as 
the genotype relative risk for a heterozygote, for an allele with frequency 0. 1 and purely 
5 additive inheritance. Analytical approximations (solid circles), Eqs. 1 and 2, are 

indistinguishable from the exact results when the genotype relative risk is smaller than 2. The 
disease prevalence r is 10% for the affected/unaffected pools, and 27% of the population is 
selected for each of the tail pools. (Bottom) The frequency difference at the significance 
threshold is shown for the same parameters. This threshold deteirnines the measurement 
1 0 accuracy required for association tests based on pooled DNA. 



Detailed Description of the Invention 

15 1. Definitions 

Glossary of mathematical symbols 

X quantitative phenotypic value of an individual 

Xi quantitative phenotypic value of sib z, with / = 1 or 2 for sib-pairs 

X ± (X l ±X 2 )/2 

20 r phenotypic correlation between sibs 

At allele inherited at a particular locus. For a bi-allelic marker, i = 1 or 2 
G genotype at the locus, either^i^i, A\A 2 , or A 2 A 2 for a bi-allelic marker 
Gi genotype for sib i, with i ~ 1 or 2 for sib-pairs 
genotype probability 

25 P{G\ 7 G2) joint sib-pair genotype probability 

flXiJG) joint sib-pair phenotype probability distribution 
f[X\JC2\G\ 9 G2~\ joint sib-pair phenotype probability distribution conditioned on 
genotypes 

p frequency of allele A\ in a population 
30 q frequency of the remaining alleles, with q = 1 —p 

Pi frequency of allele A i in sib i, either 1, 0.5, or 0 for an autosomal marker 

9 
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P± (p\±P2)I2 

a half the difference in the shift in the mean phenotypic value of individuals with 
genotype A\A\ compared to A2A2 

d difference in the mean phenotypic value between individuals with genotype 
5 A \A2 compared to the mid-point of the means for A\A\ and A2A2 

\i mean phenotypic shift due to the locus, equal to a(p-q) + 2pqd 
a a 2 additive variance of phenotype Xdue to the genotype G 
ap 2 dominance variance due to the genotype G 
or residual phenotypic variance, with <Sa + <5n + &r — 1 
10 N the total number of individuals whose DNA is available for pooling 

n number of individuals selected for a single pool 
p pooling fraction defined as nIN 

Pi/J>l frequency of allele A\ in the upper (U) or lower (L) pool 

T test statistic, which is expected to be close to zero when the genotype G does 

1 5 not affect the phenotypic value and is expected to be non-zero when individuals with 

genotypes AiAi,A\A 2 , and A2A2 have different mean phenotypic values. As formulated 
here, Thas a normal distribution with unit variance. Under the null hypothesis that ga 
= (^P4) in l a -(p- ( j)^l IS zero > me mean of T is zero. Under the alternative hypothesis 
that (j a is non-zero, the mean of 7* is also non-zero. 

20 Go 2 variance of n m (pu-p£) under the null hypothesis 

a 1 2 variance of n xn {pir-pL) under the alternative hypothesis 

3>(z) cumulative standard normal probability, the area under a standard normal 

distribution up to normal deviate z 

z a normal deviate corresponding to an upper tail area of a, defined as <J>(z a ) = 1— a 
25 a type I error rate (false-positive rate). For a one-sided test, T> z a corresponds to 

statistical significance at level a, typically termed a p-value. A typical threshold for 
significance is a p-value smaller than 0.05 or 0.01. If M independent tests are 
conducted, a conservative correction that yields a final p- value of a is to use a p-value 
of aJM for each of the M tests. 
30 P type II error rate (false-negative rate). The power of a test is 1-p. 

H(x) Heaviside step function 

10 
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As used herein, when two individuals are "related to each other", they are genetically related 
in a direct parent-child relationship or a sibling relationship. In a sibling relationship, the two 
individuals of the sibling pair have the same biological father and the same biological mother. 
5 As used herein, the term "sib" is used to designate the word "sibling", and the sibling 

relationship is defined above. The term "sib pair" is used to designate a set of two siblings. 

The members of a sib pair may be dizygotic, indicating that they originate from different 
fertilized ova. A sib pair includes dizygotic twins. 

10 

The focus of the present invention is to examine the statistical power of pooling designs for 
quantitative phenotypes. A variance components model provides the distribution of 
phenotypic values for an unselected population of unrelated individuals or sib pairs. The 
phenotype is partitioned into contributions from a specific causative allele and from residual 

15 shared and non-shared familial and genetic factors. The genotype-dependent phenotype 

distribution for sib pairs under Hardy- Weinberg equihbrium is used as the basis for analyzing 
the statistical power of various pooling strategies. The test statistic in each case is the allele 
frequency difference between two pools, appropriately standardized to a normal distribution. 
Numerically exact results are provided for a range of parameters including the fraction of 

20 population pooled, the allele frequency, and the dominant or recessive character of the allele. 
Furthermore, upon consideration of the relative powers of pooling designs, pooling designs are 
suggested for particular phenotype characteristics. 

2. Model 1 

25 2.1 Biometrical Genetic Model 

A quantitative phenotype X, standardized to zero mean and unit variance, is hypothesized to be 
affected by the genotype G at a biallelic locus with alleles A\ and A 2 , occurring at population 
frequencies p\ and p 2 — l—pu More generally, A 2 may represent any of a number of alternate 
alleles, and p2 their aggregate frequency. The population is assumed to be random mating, 
30 with genotype frequencies P(G) of p\p u 2p\p 2 , andp^ fox A\A\, A\A 2> andA^ 

respectively. The frequency of allele p\ in genotype <j, denoted p& is 1 for A\A\, 0.5 for -4 1^2, 

11 
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much-reduced cost. Here we analyze pooling methods to establish association between a genetic polymorphism and a quantitative 
phenotype. Exact results are provided for the statistical power for a number of pooling designs where the phenotype is described by 
^ a variance components model and the fraction of the population pooled is optimized to minimize the population requirements. For 
^§ low to moderate sibling phenotypic correlation, unrelated population requirements. For low to moderate sibling phenotypic correla- 
^1 tion, unrelated populations are more powerful than sib pair populations with an equal number of individuals, for sibling phenotypic 
correlations above 75 %, however, designs selecting the sib pairs with the greatest phenotype difference become more powerful. For 
£J sibling phenotype correlations below 75 %, pooling extreme unrelated individuals is the most powerful design for sib pair popula- 
tions. The optima] pooling fractions for each design are constant over a wide range of parameters. These results for quantitative 
pheno types differ from those reported for qualitative pheno types, for which unrelated populations are more powerful than sib pairs 
and concordant designs are more powerful than discordant, and have immediate relevance to ongoing association studies and antici- 
pa ted whole-genome scans. 
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DNA Pooling Methods For Quantitative Traits Using Unrelated Populations 

Or Sib Pairs 

5 

Background of the Invention 

The complex diseases that present the greatest challenge to modern medicine, 
including cancer, cardiovascular disease, and metabolic disorders, arise through the interplay 
of numerous genetic and environmental factors. One of the primary goals of the human 

10 genome project is to assist in the risk-assessment, prevention, detection, and treatment of these 
complex disorders by identifying the genetic components. Disentangling the genetic and 
environmental factors requires carefully designed studies. One approach is to study highly 
homogenous populations (Nillson and Rose 1999; Rabinow, 1999; Frank 2000). A recognized 
drawback of this approach, however, is that disease-associated markers or causative alleles 

15 found in an isolated population might not be relevant for a larger population. An attractive 

alternative is to use well-matched case-control studies of a more diverse population. A second 
alternative is to study siblings, inherently matched for environmental effects. 

Even with a well-matched sample set, the genetic factors contributing to an aberrant 

20 phenotype may be difficult to determine. Traditional linkage analysis methods identify 

physical regions of DNA whose inheritance pattern correlates with the inheritance of a 

particular trait (Liu 1997; Sham 1997, Ott 1999). These regions may contain millions of 

nucleotides and tens to hundreds of genes, and identifying the causative mutation or a tightly 

linked marker is still a challenge. A more recent approach is to use a sufficiently dense 

25 marker set to identify causative changes directly. Single nucleotide polymorphisms, or SNPs, 

can provide such a marker set (Cargill et al. 1999). These are typically bi-allelic markers with 

linkage disequiHbrium extending an estimated 10,000 to 100,000 nucleotides in heterogeneous 

human populations (Kruglyak 1999; Collins et al. 2000). Tens to hundreds of thousands of 

these closely spaced markers are required for a complete scan of the 3 billion nucleotides in 

30 the human genome. Because each SNP constitutes a separate test, the significance threshold 
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must be adjusted for multiple hypotheses (p- value - 10" ) to identify statistically meaningful 
associations. Consequently, hundreds to thousands of individuals are required for association 
studies (Risch and Merikangas 1996). 

5 The most powerful tests of association require that each individual be genotyped for 

every marker (Fulker et al. 1995, Kruglyak and Lander 1995, Abecasis et al. 2000, Cardon 
2000) and remain far too costly for all but testing candidate genes. An alternative that 
circumvents the need for individual genotypes, related to previous DNA pooling methods for 
determination of linkage between a molecular marker and a quantitative trait locus (Darvasi 

10 and Soller 1 994), is to determine allele frequencies for sub-populations pooled on the basis of 
a qualitative phenotype. Populations of unrelated individuals, separated into affected and 
unaffected pools, have greater power than related populations. If a population consists of sib- 
pairs, concordant pairs versus unrelated controls have greater power than discordant pairs 
separated into affected and unaffected pools (Risch and Teng 1998). Nevertheless, discordant 

1 5 designs might provide a better control for corifounding factors such as age, ethnicity, or 
environmental effects. 

The phenotypes relevant for complex disease are often quantitative, however, and 
converting a quantitative score to a qualitative classification represents a loss of information 

20 that can reduce the power of an association study. The location of the dividing line for 
affected versus unaffected classification, for example, can affect the power to detect 
association. Furthermore, pooling designs based on a comparison of numerical scores are not 
even possible with a qualitative classification scheme. These distinctions can be especially 
relevant when populations contain related individuals and qualitative tests have a disadvantage 

25 (Risch and Teng 1998). 

There remains a need for procedures that provide phenotype associations with diseases 
or pathologies based on phenotypes that may be ranked on a quantitative scale. In such a 
scheme there is a strong need to identify procedures for optimally obtaining samples, or ' 
30 pooling, from a subpopulation that provide the highest assurance of displaying associations 
that are present In addition there is a need to distinguish among various pooling strategies 
that may arise in cases with different allele frequencies and different allele correlations. There 
is a further need to devise a test criterion for establishing the significance of associations 
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between phenotypes and diseases or pathologies that may arise. The present invention - 
addresses these and related deficiencies that currently exist. 

Summary of the Invention 

5 The present invention is based, in part, on the discovery of methods to detect an 

association in a population of individuals between a genetic locus and a quantitative 
phenotype, where two or more alleles occur at a given genetic locus, and the phenotype is 
expressed using a numerical phenotypic value whose range falls within a first numerical limit 
and a second numerical limit. These limits are used to provide for subpopulafions that consist . 
10 of upper and lower pools. 

In some embodiments, the population of individuals includes individuals who may be 
classified into classes. In certain aspects of the invention, these classes are based on age, 
gender, race, or ethnic origin. In other aspects, some or all members of a class are included in 
the pools. 

15 In various embodiments, these numerical limits are chosen so that the upper pool 

includes the highest 10%, 15%, 20%, 25%, 27%, 30%, or 35% of the population. In other 
embodiments, the numerical limits are chosen such that the lower pool includes the lowest 
10%, 15%, 20%, 25%, 27%, 30%, or 35% of the population. 

In one embodiment of the invention, the numerical limits are chosen to minimize false- 

20 negative errors. 

In the present invention, the population of individuals can include unrelated individuals 
or related individuals. In one aspect, these related individuals are sibling pairs (sib pairs). In a 
further aspect, each member of the sib pair is selected for the upper pool. In a still further 
aspect, each member of the sib pair is selected for the lower pool. In still yet another aspect, 

25 neither member of the sib pair is selected. In another aspect, one member of the sib pair is 

selected for the upper pool and the other member of the sib pair is selected for the lower pool. 

In one embodiment of the invention, sib pairs are ranked by the absolute magnitude of • 
the difference in phenotypic value for the siblings within each pair. In one aspect, the percent 
of pairs with the greatest difference are identified, and the siblings in each pair are distributed 

30 such that the sibling with the high phenotypic value is selected for the upper pool and the 
sibling with the low phenotypic value is selected for the lower pool. In an aspect of this 

3 
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embodiment, the phenotypic value of one member of the sibling pair is above a predetermined 
lower limit and the phenotypic value of the second member of the sibling pair is below a 
predetermined upper limit. In various other aspects, the percentage of pairs with the greatest 
difference is 80%, 70%, 60%, 54% or 50%, and the distribution provides 10%, 15%, 20%, 
5 25%, or 27% of the population in each pool. 

In an embodiment of the invention, Mahalanobis ranks are generated among sib pairs. 
In one aspect, these ranks are used to construct pools composed of the member of the sib pair 
with the more extreme Mahalanobis rank. In another aspect, the Mahalanobis ranks are used to 
generate a list in which the order of each member of a sib pair in this list is detennined by the 
1 0 smaller of the distance of a member from the first member on the list and the distance of a 
member from the last member on the list. 

Unless otherwise defined, all technical and scientific terms used herein have the same 
meaning as commonly understood by one of ordinary skill in the art to which this invention 
belongs. Although methods and materials similar or equivalent to those described herein can 
15 be used in the practice or testing of the present invention, suitable methods and materials are 
described below. All publications, patent applications, patents, and other references 
mentioned herein are incorporated by reference in their entirety. In the case of conflict, the 
present specification, including definitions, will control In addition, the materials, methods, 
and examples are illustrative only and not intended to be limiting. 

20 Other features and advantages of the invention will be apparent from the following 

detailed description and claims. 

Brief Description of the Figures 

Fig. 1. Shaded regions illustrate which siblings are selected under different pooling designs. 
The x-axis represents X\ 9 the phenotypic value for the first sibling, and the y-axis represents 
25 X 2 , the value for the second sibling. The indicator functions Im> Aj2» Ilu and Iu. take the value 
1 when a sibling is selected for the denoted pool and are 0 otherwise. The unrelated-random 
design assumes a population of unrelated individuals, and only the first sibling is used. The 
pair-mean design depends on the sibling phenotype mean X+ — (Xi + Ai)/2; the pair-difference 
design depends on the difference X. = {X\ -X%)I2. 

30 
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Fig. 2. The population N necessary to detect association is shown as a function of the pooling 
fraction/) for three values of the sibling phenotype correlation r. Panel A: r = 0.1, low 
correlation; Panel B: r = 0.5, moderate correlation; Panel C: r = 0.9, high correlation. The 
values of the remaining parameters are a = 5x1 0~ 8 , 1 -ft - 0.8, p\ = 0.1, ox 2 — 0.02, and c/Az = 
5 0. For low to moderate sibling correlation, the unrelated-random design is more powerful than 
any design using sib pairs; for high sibling correlation, sib-apart designs are more powerful. 
The flat minima indicate that pooling fractions close to the minim a are near optimal. 

Fig. 3. The population N necessary to detect association is shown as a function of the sibling 
10 phenotype correlation r. The pooling fraction p is optimized to minimiz e the population 

requirements at specified false-positive rate a — 5x1 0~ 8 and power 1 - 0.8 with remaining 
parameters p\ = 0.1, c A 2 = 0.02, and dla = 0. Panel A: Below r = 0.75, the unrelated-random 
design is most powerful, followed by unrelated-extreme for sib pairs; above r = 0.75, the pair- 
difference design is most powerful. The sib-apart designs are more powerful than sib-together 
15 designs above r = 0.5 but are less powerful below this value. Panel B: The optimal pooling 
fraction is approximately 0.27 for the unrelated-random, pair-mean, pair-difference, and 
concordant designs; 0.18 for the unrelated-extreme design; and 0.23 for the discordant design. 
The optimal pooling fraction decreases for sib-apart designs in regions of large sibling 
correlation. 

20 

Fig. 4. The population N necessary to detect association is shown as a function of the minor- 
allele frequency pi. The pooling fraction p is optimized to minimize the population 
requirements at specified false-positive rate a = 5xl0~ 8 and power 1 -fi = 0.8 with remaining 
parameters r = 0.4, a A 2 = 0.02, and dla ~ 0. Panel A: The population AT is relative flat until p\ 
25 falls below the additive variance cta 2 , at which point the phenotype becomes nearly monogenic 
and the population requirement decreases. Panel B: The optimal pooling fraction p is relative 
flat until pi falls below the additive variance cta 2 , at which point it decreases rapidly. 

Fig. 5. The population //necessary to detect association is shown as a function of the additive 
30 variance a a 2 . The pooling fraction p is optimized to niimrnize the population requirements at 
specified false-positive rate a — 5x 10" 8 and power 1-/3 = 0.8 with remaining parameters 
r = 0.4, p\ = 0.1, and dla = 0. Panel A: The population requirement is inversely proportional 
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to l/a A 2 , except for vary large values of a A 2 characteristic of a monogenic trait. Panel B: The 
optimal pooling fraction/? is independent of cr A 2 except for large values of ct a 2 . 

Fig. 6. The population N necessary to detect association is shown for four values of the 
dominance ratio dla as a function of the pooling fraction p. The remaining parameters are 
a = 5xl0 -8 , 1-^ = 0.8, r = 0.4, p x = 0.1, and a A 2 = 0.02. Panel A: dla = — 1 (pure recessive); 
Panel B:d/a= -0.9; Panel C: dfa = -0.5; Panel D: dla = 1 (pure dominant). These values were 
selected to sample the ratio of dominance variance to total variance for the allele, 
od /(o*d + ct a ). Most association methods are more sensitive to additive variance than 
dominance variance. Close to dla = lf(2p\ - 1), the additive variance vanishes and the curve 
ofN versus p changes from having a shallow minimum near p = 0.27 (p = 0. 1 8 for unrelated- 
extreme) to being steeply sloped toward p — G. For rare aiieles, this behavior occurs in a 
narrow region near dla = -1 (pure recessive). 

Fig. 7. The population AT necessary to detect association is shown as a function of the 
dominance ratio dla. Panel A: TV when the pooling fraction p = 0.2; Panel B: N when p has 
been optimized to minimize the population requirements for each value of dla; Panel C: the 
optimized/?. The remaining parameters are a = 5xl0~ 8 , 1 -/? = 0.8, r = 0.4, p\ = 0.1, 
ct a = 0.02. When p = 0.2, near-optimal for alleles with additive variance, the population 
requirements increase markedly near dla = -1 where the additive variance is small relative to 
the dominance variance for a low-frequency allele. The population requirements to detect rare 
recessive alleles could be reduced by decreasing p by 10-fold to 100-fold, but this would 
reduce the power to detect association for alleles outside of this narrow region of large 
dominance variance. The population requirements and the optimal pooling fraction are not 
sensitive to changes in dla for low-frequency alleles that are under-dominant (dla < -2), 
weakly recessive (dla « -0.5), additive (dla = 0), dominant (dla = 1), or over-dominant (dla > 
1). 

Fig. 8. The population //required to detect association is shown as a function of the Type I 
error rate a and the Type II error rate /?. The pooling fraction p has been optimized to 
minimize the population size. Panel A: Nis asymptotic to 2 ln(l/a) for small values of a. The 
remaining parameters are 1 = 0.8, r = 0.4, p x = 0.1, a A 2 = 0.02, and dla = 0. Panel B: The 
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optimal pooling fraction p is not sensitive to changes in a. Panel C: The required population 
increases when p decreases. The remaining parameters are a = 5xl0" 5 , appropriate for a test 
of 1000 candidate polymorphisms versus a single phenotype, r = 0.4, p\ = 0.1, a a - 0*02, and 
d/a = 0. 

5 

Fig. 9. The repository size required to detect association using pooled DNA is shown as a 
function of the fraction of population p selected for each pool, relative to the repository size 
required for a regression test using individual genotyping, for a QTL making a small 
contribution to a complex trait. The same family structure and the same phenotypic variable, 

10 either the individual phenotype, the pair-mean, the pair-difference, or the combined results 
anH nair-differenee tests, are used for tests based on pooling and individual 
genotyping. All of these tests show the same relative efficiency as a function of pooling 
fraction, with an optimal fraction of 0.27 requiring only 1 .24x the population for individual 
genotyping. The Mahalanobis design is compared to the combined regression test for a sibling 

1 5 phenotypic correlation of t R = 0.6. The optimum occurs for this, and all other values of t Ri at p 
= 0.188. 

Fig. 10. The repository size required to detect association for the Mahalanobis design, relative 
to the population required for a combined regression test using individual genotypes, is shown 
20 as a function of the sibling phenotypic correlation t R . 

Fig. 1 1 . The number of individuals required for pooling designs with a sib-pair family 
structure is compared to the number of unrelated individuals for an association test of 
equivalent power and significance as a function of the sibling phenotypic correlation t R . 

25 

Fig. 12. (A) Exact numerical results for the repository size required to detect association are 
shown for pooling designs as a function of ct//o> 2 , the ratio of the additive variance of the 
QTL to the residual variance. The remaining parameters are allele frequency 0.1, additive 
inheritance, type I error ixlO" 8 , and type II error 0.2. (B) The allele frequency difference at 
30 significance is shown for the same parameters as in Fig. 12A. In this an all subsequent figures, 
unrelated-population is a dotted line, Mahalanobis a thin line, pair-mean a dashed line, pair- 
difference a dot-dashed line, and sib-combined a thick line. 
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and 0 forA 2 A 2 . The bivariate probability distribution P(GuG{) of the 9 possible combinations 
of dizygotic sib-pair genotypes G\ and G 2j shown in Table I, can be derived by considering all 
possible parental mating types and their offspring genotype distributions (Neale and Cardon 
1992). The shared genetic makeup implies that P{G u Gz) ± P(Gi)P(G 2 y 

5 

Using the notation defined above, the effect /ig of genotype G on the phenotype is a-p, d-p, 
and -a-n for genotypes A\Au A\A 2 , and A2A2 respectively. The constant p. = a{p\ - p 2 ) + 2d 
p\P2 ensures that the phenotype has zero mean. The ratio dla y termed the dominance ratio, is - 
1 for a recessive allele, +1 for a dominant allele, and 0 for an additive allele. 

10 

The phenotypic variance contributed by the genotype G can be partitioned into an additive 
component <3? and an dominance component cz> 2 , with 
o A 2 = 2pq[a-d(p-q)f , and 
vd = 4p 2 q 2 c? . 

15 In a population of unrelated individuals, the distribution f[X\ of trait values is a mixture of 3 

univariate normals, one for each genotype: 

/[AT=S G /[^ G ]P(G),with 

/[A> c ] = (2na R 2 T m ^p[~(X~ M G) 2 /2a R 2 ] 

and the residual variance a R 2 = 1-g a 2 -<j d 2 . 

20 

Similarly, in a population of sib pairs, the bivariate distribution of trait values f[X\X 2 ] is a 
mixture of 9 bivariate normals, appropriately weighted according to genotype combination: 
flX\X\= Z /W^|^i,G 2 ]P[G 1 ,G 2 ]. 

The mean of Xj is \i Cj for sib j = 1 or 2; both X\ and X 2 have residual variance Or 2 = 1 - cy^ 2 - 

25 gd ; and X\ and X 2 have correlation r& due to shared residual polygenic effects and 

environmental factors. The total correlation r between sib pairs, including effects from 

genotype G, is 

r = r R + CT//2 + 0£, 2 /4. 

30 It is convenient to re-express the phenotypes of sib pairs in terms of X+ and JC, defined as the 
linear combinations X± — {X\ ± X 2 y2> because these components are uncorrelated and the 
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probability distribution/ [X +y X- \G U G 2 ] factors into the product/[X + lGi,G2]/[X-|Gi,G2]. The 
individual probability distributions for X+ andX_ are 
f[X ± \G u G 2 ] = (27icr ± 2 r 1/2 exp[-(X ± -^ ± ) 2 /2cy ± 2 ], with 
M±(GuG 2 ) = ( \x Gi ± »g 2 and 
5 a± 2 = a R 2 (l±rR)/2. 

Allele frequencies p± are sirnilarly defined as pt(GuG2) = (p Gi ± p Gl )/2. 

2.2 Test Statistic and the Null Hypothesis 

10 We consider tests in which an upper and lower pool, each containing n individuals, are 
selected according to higher and lower phenotypic values from a larger population of N 
individuals. The frequencies pu and p L of allele A i are calculated for the upper and lower 
pools., and the frequency difference is converted to the test statistic T, 

1 5 The variance pu — p^ under the null hypothesis that genotype G has no effect on phenotype X 
is VarQ?u — /? L ) = a^ln. When the null hypothesis is valid and n is large, T follows a standard 
normal distribution and ct 0 is independent of n. 

The value of o*o depends on the population allele frequencies and also on the method used to 
20 select the n individuals for each pool. Specifically, let rtc be the total number of sib pairs 

selected for the same pool and no be the number split between pools, with the remaining 2(n - 
«c _ «d) individuals unrelated. The contribution of the unrelated individuals to Varipu-pO is 
[2(n-7ic-nD)/tt 2 ]Var(pG), and the individual variance is 
War(p G ) =pi 2 (l) + 2j?tf72(l/4) -pi 2 =pip 2 /2. 
25 The contribution of the pooled-together sib-pairs is 

[nc/n 2 ]V<x(Pc t +Pg 2 ) = [nc/n 2 ][2Vzi(p G ) + 2Cov( Pc?i ,p G% )] = ( Wc /« 2 )(3p^/2) 
because the covariance between genotypes in a sib-pair is half the individual variance, 
reflecting that sibs share half their genetic material. Similarly, the contribution of the pooled- 
apart sib-pairs is 

30 [*r/« 2 ]Var( a;, - Pa% ) = [« D /n 2 ][2Var(p Gf ) - 2Cov(p Gi ,p G% )]. 

13 
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The result for cto 2 is 

CT 0 2 = [1 + («c/2«) - (nTJ2n)]pxp2 , 

with important limitiag cases of p\p%/2 for pure sib-apart pooling, p\p2 for pure unrelated 
pooling, and 3p\p 2 /2 for pure sib-together pooling. 

The allele frequency p\ may be determined from the entire population. It is also possible to 
estimate p x as the mean (pu + pl)/2, which is closer to 0.5 than the population mean pi in the 
case of true association. The resulting cro is larger, and using the mean results in a 
conservative test. 

2.3 Pooling Design 



A pooling design is a set of rules to determine which sibs are selected for the upper and lower 
pools. For an unrelated population, these rules take the form of a pair of indicator functions 
1 5 Iu(X) for the upper pool and I\iX) for the lower pool. Each function takes the value 1 if an 
individual is selected for the specified pool and is 0 otherwise. In general, individuals are 
selected for at most one pool and I\j + 7L is either 0 or 1 . 

The rules for sib-pairs may be formulated in terms of four indicator functions which depend on 
20 both sibling phenotypic values X\ and X 2 . These indicator functions may be written Isj{Xi>X 2 ) 
or equivalently Is/X+JCJ), where the side S is U or L and j — 1 or 2 labels the sibling. The 
indicator function has value 1 if sib j is selected for side S and is 0 otherwise. As before, each 
individual is selected for at most one pool and Jfy + Iy is either 0 or 1. 

25 A summary of pooling designs in terms of the indicator functions is provided in Table II. The 
indicator functions are specified by upper and lower phenotype thresholds Xu and X^ and the 
Heaviside step function H(jc), 

f 1, x>0; 
H(;t)=<l/2, x = 0; 

0 jc<0. 

The values of Xu and X^ are denned implicitly by the requirement that the upper pool and 

30 lower pool each contains a fraction p of the total population. 
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Three types of designs are considered: unrelated pooling designs, in which none of the 2n 
pooled individuals are related (although the individuals may be drawn from a larger population 
of related individuals); sib-together pooling designs, in which each pool consists of n/2 sib 
5 pairs; and sib-apart poolingdesigns, in which n sib pairs are split between the upper and lower 
pools. 

Unrelated Pooling Designs 

Two types of unrelated pools are shown. The first, unrelated-random, pools the n individuals 
with the highest and lowest phenotypic values from a population of AT unrelated individuals. 
1 0 The term random arises because the N unrelated individuals may be obtained by selecting one 
sib at random from an initial population of .ATsib pairs. 

The second unrelated design, unrelated-extreme, first reduces a population of Nil sib pairs to 
N/2 unrelated individuals by selecting the individual with the more extreme phenotypic value 
1 5 from each sib pair. Tails with n individuals are then selected for pooling from this unrelated 
sub-population. The more extreme sib is defined as having a greater distance \Xj\ from the 
phenotype mean. Other definitions of distance, such as the distance from the phenotype 
median, or non-parametric definitions, such as the phenotype percentile score, are also 
possible and yield similar results for a normal distribution of phenotype scores. 

20 

Sib-Together Pooling Designs 

Two sib-together designs are analyzed, each starting with a population of N individuals in N/2 
sib pairs. The first, termed concordant, is analogous to concordant pooling based on a 
qualitative, afTected/unaffected classification. If both sibs have phenotypic values above an 
25 upper threshold X\j, the pair is selected for the upper pool; if both values are below a lower 

threshold Xu the pair is selected for the lower pool. The thresholds are adjusted until n/2 pairs 
have been added to each pool. The second sib-together design, pair-mean, is based on the 
phenotype mean X+ for each pair: above Xu and the pair is selected for the upper pool; below 
Xl and the pair is selected for the lower pool. 

30 

Sib- Apart Pooling Designs 

Two sib-apart designs are also analyzed, each starting with N/2 sib pairs. The first is termed 

discordant, again analogous to qualitative discordant pooling. If one sib in a pair has a 
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phenotypic value above an upper threshold Au and the other has a value below a lower - 
threshold AL, the sib with the higher value is selected for the upper pool and the sib with the 
lower value is selected for the lower pool. The thresholds Au and AL must have an additional 
constraint in order to arrive at a unique solution. The constraint used here is that the 
5 thresholds straddle the phenoype mean and are equidistant from it. Other constraints, such as 
at equal percentiles away from the median phenotype, are possible but give similar results for 
a normal distribution of phenotype scores. 

The second sib-apart design, termed pair-difference, selects the n sib pairs with the greatest 
1 0 magnitude of difference |Ai-A" 2 | in phenotypic values. The sib with the higher value is 
selected for the upper pool and the sib with the lower value enters the lower pool. Again, 
more general measures of distance are possible. 

The depiction of pooling designs in Fig. 1 complements the mathematical description. Each of 
15 the six panels displays one of the pooling designs identified above. The coordinate axes are Ai 
and Ai, the sib-pair phenotypic values, and cross at the overall phenotype mean of 0. Areas in 
the graph are shaded when one or more of the indicator functions is 1. In the unrelated- 
random design at the upper left, for example, an unrelated population is generated by taking 
the first sib from each pair and the pooled regions are vertical half-planes. If the second sib 
20 had been taken from each pair, the half-planes would be horizontal. The panel in the upper 
right depicts the unrelated-extreme pools. The regions corresponding to sib 1 being extreme 
are the two triangles bordered by X\ = ±X 2 and along the horizontal axis. These regions are 
truncated at the upper threshold Au and the lower threshold AL to yield the contribution of sib 
1 to the upper and lower pools. Sib 2 makes similar contributions, symmetric across the X\ — 
25 Xi axis. This panel shows an example where X\j ^ — AL, which is the general case when the 

phenotype mean and median do not coincide. When equality holds, the excluded region in the 
center is perfectly square. 

The middle panels depict the two sib-together designs. On the left is the concordant design: to 
30 be selected for pooling, both sibs must be above or below a threshold. The upper threshold Au 
could also provide the definition for a qualitative classification affected/unaffected. In this 
case, the vertex of the lower pool moves northeast to meet the vertex of the upper pool at the 
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phenotypic values X Vy X\j. The panel to the right shows the pair-mean design. Here, sib pairs 
are selected if their mean X+ exceeds an upper threshold Xv or falls below a lower threshold 
X L . The orthogonal coordinate X- is uncorrelated with X+ and unconstrained in this design. 
Note that the boundary lines X+ = Xv and X+ = X h have intercepts 2X V and 2X L in the X\-X 2 
coordinate system. 

The bottom panels depict the discordant design on the left and the pair-difference design on 
the right. The discordant design selects sib-pairs from rectangular regions in the upper left and 
lower right; the pooling boundaries in the pair-difference design are lines of constant X., with 
X+ unconstrained. 

Despite the close analogy, there is an important difference between the concordant and 
discordant designs described here for quantitative traits and the designs described elsewhere 
for qualitative traits (Risen and Teng, 1998). In this formulation for quantitative traits, the 
upper and lower thresholds define tails of a population distribution and a sizeable population 
fraction falls between the tails. In a typical formulation for qualitative traits, and especially for 
qualitative traits without an obvious quantitative basis, a single threshold divides the 
population into two classes: a smaller affected class and a larger unaffected class holding most 
of the population. In the tenninology used here, such designs have X\j = Xl. 

2.4 Distribution of Pu-Pl under the alternative hypothesis 

The fraction p s of the total population selected for each pool may be written 
^ X pSjiGuGz), 

G lt G 2 J =1,2 

where, as before, S = U or L labels the upper or lower pool and 

p Sj <GuG2) = (1/2) P(G U G 2 ) J dX+ J dX-f[X,\ Gx&MVQ GuG 2 ] hjiX+JC) . 

-CO -co 

The initial factor of (1/2) arises because the phenotype and genotype distributions are 
normalized to 1 per sib-pair rather than 2. In practice, the upper and lower thresholds Xv and 
aL are adj osted until the fraction in each pool is p < 1 . For an unrelated population or for a 
sib-pair population pooled with the pair-mean or pair-difference design, the largest possible/) 
is 0.5 and the entire population sprits evenly into two pools. The concordant and discordant 
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designs have a maximum p that is smaller than 0.5 because, as can be seen from Fig. 1,: these 
designs always exclude quadrants of the total population. For a sib-pair population with the 
unrelated-extreme design, the largest possible p is 0.25. 

For feasible values of /), the expected allele frequency in pool S is 
ps = p~ l S .Psj{G u G2)p Gj , 

where p G is the allele frequency of the 7 th sib of the pair and the expected number of such sibs 

selected for the pool is np~ l p${GuG2). These numbers follow a multinomial distribution, with 
the following general properties: when a random variable x = tT'S/ «pc/ with the index 1 
ranging over a discrete set of sub-populations, the total number of samples n — 5*n/ fixed, 
fixed for ail samples from sub-population /, the expectation values njn = fixed, and ty&t = 1, 
then the expectation value of x is Z/ftx,- and its variance Var(;t) = n~ l {Z/ft* 2 - (Z/ftx,) 2 } (Beyer, 
1984). Using these results for a multinomial distribution, the variance of the test statistic 
under the alternative hypothesis is written 
Var(pu-/? L ) = <Ji 2 /n 

where a* is independent of the number of individuals n per pool. 

For the unrelated-extreme design, pu and p^ are independent multinomial distributions and 
a * = P " ! <JL <X + PiAPuG2){ P 2 Gi -PL 2 ) }. 

For the unrelated-random design, the index j is irrelevant, yielding simpler expressions: 
ps = p" 1 2>gPs(G)pg , and 

a, 2 = p-'Eo {pu(G)(pG 2 -pu 2 ) + pdG)(PG 2 -pu 2 )}- 

For the sib-together designs, Is\ = Is2 and the expected allele frequencies are 
Ps^P~ l J? 2p sl (Gi,G 2 )p + . 

The corresponding frequencies ft for the multinomial distribution are 2p~ 1 ps 1 (G 1 ,G 2 ) and the 
effective number of samples is n/2. The resulting variance term is 
ai2 = 2p_1 <S { 2 Pui(^^)(P+ 2 -A/ 2 ) + 2p L i(G I ,G 2 )(p + 2 -pl 2 ) }. 
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For the sib-apart designs, Iv\=Iia ^ Ily^Iw- The expectation value of the allele frequency 
difference is 

PV-PL = P" 1 % PUl P Gi + PU2 PG 2 ~ PLI P Gl ~ 912 P Gl = P" 1 ^ 2 PU^- + P" 1 <^ 2 PLl(- J P-)- 

Due to the symmetry between the two siblings, p" 1 Z GltGl 2pui = P~* Dg„g 2 2 Pu - U andpu-PL 
5 is the sum of two multinomial distributions each with expectation value (pu~Pl)/2. The 
effective number of samples for each distribution is n/2, and the variance term is 
en 2 = 2p~ l S 2(pui + pu){p- ~ [(Pu-Pl)/2] 2 }. 

G X ,G 2 

When the null hypothesis is valid, each of these expressions for o*i reduces to the 
corresponding expression for a<>. If the alternative hypothesis is valid, ai is smaller than a 0 to 
1 0 the extent that variance in the test statistic is explained by the pooling design. Nevertheless, in 
most cases o~o is an excellent approximation. 

2.5 Power 

The statistical power 1-/? to reject the null hypothesis for a single one-tailed test with p-value 
15 a, where a is equivalent to the false-positive rate or Type I error rate and p is equivalent to the 
false-negative rate or Type II error rate, is 
\-p= 1 -0{ [z a oo -Vn (pu-pOl/cri } , 

where 0(z) is the cumulative standard normal distribution, l-<D(z a ) = «- Solving for n and 
using the relation n/N= p, the total number of individuals //necessary to generate pools with 
20 the required power is 

N = p~ l [ (ZaGo - Zi-0G\) f (pu - Pl)] 2 , 

where p = n/N is the fraction of the total population selected for each pool, hi either case, 
replacing <j\ with Co would result in a conservative test. 

25 2.6 Computational Methods 

Exact results for the distribution of the test statistic T under the null hypothesis and under the 
alternative hypothesis, subject only to the approximation that Tis normal, were obtained by 
numerical computations converged to better than 1 part in 10 6 (Press et al. 1997). Brent's root- 
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finding algorithm was used to determine the threshold values X\j and X\, for the upper and 
lower pools for a given pooling design and pooling fraction p; Brent's optimization algorithm 
was then used to find the p with maximum power. The integrals providing p\j —p L and o"i 2 
were evaluated numerically using Romberg integration with a change of variables to the 
reciprocal for infinite integration limits. Integration was restricted to regions where an 
indicator function was non-zero. In order to reduce computational requirements, the final 
integral of a normal distribution over fixed limits was evaluated using a polynomial 
approximation to the error function. This technique reduced the two-dimensional integrals 
over bivariate normals to one-dimensional integrals for the unrelated-extreme, concordant, and 
discordant designs, while integration was avoided completely for the unrelated-random, pair- 
mean, and pair-difference designs. The 9 sib-pair genotypes were reduced by symmetry to 5 
genotypes for further savings. Using a 750 MHz Pentium HI runnin g Linux, the root-finding 
and rmriirnization for each parameter set required less than 0.01 sec each for the unrelated- 
random, pair-mean, and pair-difference designs and approximatley 6 sec each for the 
unrelated-extreme, concordant, and discordant designs. 

The numerical results, and the underlying theory, are robust when n, the number of individuals 
per pool, is large and 2(pv + pi)n, the number of alleles in the pools, approximately follows a 
normal distribution. In certain regions of extreme parameter values, however, the numerical 
solution for n drops below 1 . This behavior signals a breakdown of various assumptions of the 
theory, and results in these regions are unreliable. 

The properties and characteristics of the methods of the present invention are set forth in the 
Examples. It is shown, for example, that the optimal design for unrelated individuals is to 
pool the top and bottom 27% of the population. This design using N unrelated individuals has 
greater power than designs using N/2 sib pairs when the phenotypic correlation between sibs is 
low to moderate, below 75%, but has less power than sib pair designs when the correlation is 
above 75%. 

Of the designs explored for a population of sib pairs, the unrelated-extreme design is the best 
for low to moderate sibling phenotype correlation. In this design, the more extreme sib is 
selected from each pair, then the top and bottom 36% of this subset are pooled. When the 
correlation is high, above 75%, the best design found for sib pairs is to first select the 27% of 
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pairs with the greatest phenotype difference, then split each pair by phenotypic value to form 
an upper and lower pool. The pair-difference design might also be applied at low to moderate 
sibling correlation to reduce the rate of spurious association due to population stratification. 
The optimal pooling fractions for these designs were determined by minimizing the population 
5 requirements. The rninima were generally quite flat, and pooling fractions close to the optimal 
fractions give near-optimal results. 

Compared with the results obtained by others for pooling based on qualitative traits, the results 
derived using the methods of the present invention for quantitative traits are thought to be 

10 surprising. For earlier pooling strategies based on qualitative traits, designs using unrelated 
individuals were found to be more powerful than designs using sib pairs; when populations 
were restricted to sib pairs, concordant designs were found to have greater power than 
discordant designs (Risch and Teng 1998). In contrast, for quantitative phenotypes, the 
methods of the present invention indicate that unrelated individuals become less powerful than 

15 sib pairs when sibling correlation is high, and that sib-apart designs become more powerful 
than sib-together designs when the sibling correlation is above 50%. This result is significant 
because highly heritable traits that are likely to be the first targets of large-scale genotyping 
studies often exhibit sibling correlations of 50% or higher. Quantitative phenotypic values 
also permit the use of the unrelated-extreme design, which does not have an obvious analog 

20 for qualitative phenotypes that categorize individuals as affected/unaffected. 

The sib-together and sib-apart pooling designs of the present invention, which draw 
individuals from extreme-high and extreme-low phenotypes, are anticipated to be more 
powerful than alternative designs that compare one extreme to the remainder of the 

25 population, as in a qualitative affected/unaffected classification. The affected/unaffected 
classification establishes a single threshold for a quantitative phenotype, and the allele 
frequency in the large unaffected class is close to the population mean. In contrast, the 
quantitative designs of the present invention employ two thresholds, and the allele frequencies 
in the upper and lower pools are approximately equidistant from the population mean. The 

30 allele frequency difference between pools is consequently half as large for the qualitative 

design as for the quantitative design of the present invention, and the population requirements 

are four times as large, or half as large if the overall allele frequency is assumed to be known 

exactly. These conclusions are similar to those reached in the context of linkage analysis for 
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quantitative trait localization using extremely concordant and extremely discordant sib pairs 
(Risch and Zhang 1995, Risch and Zhang 1996, Zhang and Risen 1996, Gu et al. 1996). 

As with most genotyping designs, the pooling strategies described here are primarily sensitive 
5 to the additive variance from an allele. Since the additive variance for an allele is 

approximately equal to the fraction of heterozygotes times the square of half the phenotype 
shift between the two homozygotes, rare alleles with larger phenotype shifts may be detected 
with the same power as common alleles with smaller shifts. When the allele frequency 
becomes smaller than the additive variance of the allele, however, the frequency shift must 
10 become very large to compensate and the phenotype begins to resemble a monogenic trait. 



The results provided here also imply the precision required for allele frequency determinations for 
pooled DNA. Approximately 3000 individuals are required for a genome-wide screen with an optimal 
1 5 pool size n of 600 to 800 individuals. The frequency difference corresponding to significance at 

a=5x 10" 8 (z a = 5.33) for a polymorphism with minor-allele frequency p x is z a \p l {\-p x )fn\ l/z , which is 
5% for an allele frequency of 0.1 and 2% for an allele frequency of 0.01. An experimental 
measurement should provide an order of magnitude better precision in the allele frequency difference 
to avoid losing information. 
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3. Examples for Mod 1 1 
Overview to the Examples 

In this section, total population sizes are presented for a wide range of parameters and as 
5 functions of the pooling fraction p. The first parameters explored are the sib-pair phenotype 
correlation r and the allele frequency pi; these parameters are readily deteraiined 
experimentally at the start of an association study. The next set of parameters explored are 
the additive phenotype variance ct a 2 > the dominance ratio d/a, and the resulting dominances 
variance od 2 and genotype effects //g> which are not known at the start of a study. Finally, the 
10 dependence of the population requirements on the false-positive rate a and false-negative rate 
p is explored. As each single parameter is varied in rum, the remaining parameters are held 
fixed at a set of values selected to serve as a common reference. 

The reference value for sibling phenotype correlation was based on reported values for genetic 
15 heritabilities and shared environmental factors. Estimates of the genetic heritability for 
complex traits range from 20% for cancer (Verkasalo et al. 1999), 20% to 40% for Type 2 
diabetes melhtus (N1DDM) (Watanabe et al. 1999), 50% for pulmonary function (Wilk et al. 
2000), 10% to 50% for systolic and diastolic blood pressure (Iselius et al. 1983, Perusse 1989), 
and 70% to 90% for cholesterol level (Austin et al. 1987). Shared environmental factors are 
20 estimated to contribute 7% of the overall phenotype variance for cancer (V erkasalo et al. 

1999), 20% to 40% for blood pressure (Iselius et al. 1983, Perusse et al. 1989), and 15% for 
serum lipid levels (Heller et al. 1993). The sibling phenotype correlation, equal to half the 
genetic heritability plus the shared environmental contribution, varies over a wide range for 
these traits. A phenotype correlation of 40%, in the middle of the range, was selected to serve 
25 as the reference. 

Reported minor-allele frequencies for SNPs found in multiple populations range from 5% to 
25%, with lower frequencies for variations which cause non-conservative arnino acid changes 
and higher frequencies for conservative substitutions and changes in non^oding regions 
30 (Cargill et al. 1999, Goddard et al. 2000). A reference value of 10% was selected for p u 
typical of changes in the coding region. 
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The genetic variance arising from a typical SNP was modeled by assuming that the genetic 
heritability arises from multiple loci, each of which makes an independent contribution with a 
characteristic size equal to the generic heritability divided by the total number of contributing 
5 loci. Assuining that approximately 20 polymorphic sites contribute to a genetic heritability of 
40% yields a reference value of 0.02 for a a 2 + a D 2 . The reference value selected for the 
dominance ratio was dfa - 0, indicating a purely additive allele. 

In practice, the false-positive rate a is matched to the number of individual tests that are to be 
10 conducted in an association study. For a genome scan of 10 6 individual markers versus a 

single phenotype, for example, or for a scan of 10 4 markers versus 100 distinct phenotypes, a 
false-positive rate a per marker should be no more than 5x1 0 -8 for a final p-value < 0.05 for 
the detection of an association. If only 1000 markers are used, for example as in a test of 
candidate polymorphisms, then the value a = 5*1 0" 5 suffices. The false-positive rate selected 
15 as a reference was a — 5 * 1 0 -8 (z fl = 5.33), a value suggested to provide a sufficiently low 
number of false positives after applying a muitiple-hypothesis-testing correction 
corresponding to a full-genome scan (Risen and Merikangas 1996). The power l—fi was fixed 
at 0.S (z^ = -0.84) for a 20% false-negative rate. 

20 Figures depicting the results use a consistent scheme. The unrelated designs are represented as 
solid lines, thin for unrelated-random and thick for unrelated-extreme; the sib-together designs 
are represented as equal-spaced dashed lines, thin for concordant and thick for pair-mean; and 
the sib-apart designs are represented as unequally-spaced dashed lines, thin for discordant and 
thick for pair-difference. 

25 

Example 1. Sibling Phenotype Correlation 

The minimum population size W required to detect association as a function of the sibling 

phenotype correlation r and the pooled fraction p is shown in Fig. 2, with the remaining 

30 parameters at their previously defined reference values (a = 5x 10" 8 , 1 ~P = 0.8, p\ ~ 0. 1, 

o* A 2 = 0.02, dfa = 0), The three panels in Fig. 2 show a range of sibling phenotype 

correlations: r = 0.1 (Panel A), 0.5 (Panel B), and 0.9 (Panel C). In each panel, as the pooling 
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fraction increases from p = 0, each design has a sharp then more gradual decrease in 
population requirements. Eventually N attains a minimum, indicating the optimal pooling 
fraction for maximum power, and then gradually increases with p. A second feature seen in all 
three panels is the similarity between the unrelated designs, between the sib-together designs, 
5 with pair-mean always more powerful than concordant, and between the sib-apart designs, 
with pair-difference always more powerful than discordant. Furthermore, for larger values of 
p the required numbers of concordant and discordant sib pairs are not met. 



In Fig. 2, Panel A shows that for small values of the phenotype correlation the design with the 
10 greatest power is unrelated-random, with unrelated-extreme slightly less powerful. The sib- 
together designs require approximately twice as large a sample, and the sib-apart designs 
require three to four times as many. In Panel B, at the intermediate phenotype correlation 
r = 0.5, the unrelated designs are still the most powerful, while the sib-together designs have 
increased population requirements and the sib-apart designs have decreased to meet in the 
15 middle. At large values of the sibling phenotype correlation, r — 0.9 in Panel C, the sib-apart 
designs are most powerful. The unrelated designs require approximately twice as large a 
population, and the sib-together designs have far greater requirements. 

The regions near the minima of N for each design are quite flat, indicating that pooling 
20 fractions within 0. 1 of the rninimum may give near-optimal results. The exact values of these 
minima are depicted in Fig. 3. The population requirements are shown in Panel A, and the 
corresponding optimal pooling fractions are shown in Panel B. The unrelated-random design 
is insensitive to the sibling correlation r, as seen in Panel A, as is the unrelated-extreme design 
except at the highest values of r. The sib-together designs require larger populations as r 
25 increases, while the sib-apart designs require smaller populations. The sib-together and sib- 
apart designs cross near r = 0.5, and the sib-apart and unrelated designs cross near r = 0.75. 
The optimal pooling fractions are insensitive to the changes in the sibling correlation for 
values below r = 0.75, as seen in Panel B. The unrelated-random, pair-mean, pair-difference, 
and concordant designs have an optimal p near 0.27 in this region of low to moderate 
30 correlation, while the unrelated-extreme design has an optimum near p = 0. 18 and the 
discordant design near 0.23. For phenotypes with high correlation, r > 0.75, the optimal 
fraction for p decreases and only highly discordant sibs are selected for the sib-apart designs. 
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Example 2. Allele Frequency 

The results of changing the allele frequency p\ while optimizing the pooling traction and 
holding the remaining parameters constant at their reference values (a = 5x 10~ 8 , 1 -/? = 0.8, 
r = 0.4, ct a 2 = 0.02, dl a= 0) are shown in Fig. 4. The population requirements corresponding 
to the optimal pooling fraction p are shown in Panel A, and the corresponding fractions p in 
Panel B. The dependence on p\ is symmetric about ^1=0.5; results are shown only for the 
region p\ < 0.5 and are displayed on a logarithmic scale to highlight the behavior at low allele 
frequency. 

At moderate frequencies of the minor allele, p x > 1%, the power and pooling fraction are both 

^uvx,^!,! t w i.vy haw uii^iu ax WVJL4.WA1W _y . iiuo i/Ciiaviui, WJJ_tL/il Olid CO WllcH OA 15 llGlU CUDSiailL <U1U 

changes in p G are allowed to compensate for changes wp\, is often observed in variance 
components models (Liu 1997) . Thus, as long as the allele frequency is not too small, lower 
frequency alleles with larger effects and higher frequency alleles with smaller effects are 
found with similar power. 

At smaller allele frequencies, p\ < 1%, the increasingly rare allele has an corresponding large 
effect fiQ on the phenotype, and the population requirements decrease. The crossover into this 
region occurs when the allele frequency p\ falls below its contribution cr A 2 + ctd 2 to the overall 
phenotypic variance. The pooling fraction also decreases with p\ in this region. The 
exception to this trend is the discordant design, which has a dramatic drop in power for low 
frequency alleles. 

Example 3. Additive Allele Variance 

The population size N required to detect association is shown as Panel A in Fig. 5 as a function 
of the additive phenotypic variance arising from genotype G, with the remaining parameters 
fixed at their reference values (a = 5xl0~ 8 , 1 = 0.8, r = 0.4, />i = 0.1, dla = 0). The 
population size and the variance have a clear inverse linear relationship over three orders of 
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magnitude. This behavior corresponds to N oc (pu - p{f 2 with p\j and p^ proportional in turn 
to CT A . 

The corresponding optimal pooling fractions are shown in Fig. 5, Panel B. Over most of the 
5 range, a A 2 < 0.1, the optimal fractions are not sensitive to the variance arising from the allele. 
At larger values of the variance the phenotype becomes nearly monogenic and smaller pooling 
fractions and populations are required. 

Example 4. Recessive, Additive, and Dominant Alleles 

10 

The series of panels in Fig. 6 depicts the required population size as a function of the pooling 
fraction p for a range of dominance ratios dla. The values for dla were selected to provide 
adequate sampling of the ratio of the dorninance variance to the additive variance. This 
contribution, od 2 I(o a 2 + d 2 ), is 82% at dla = -1 (pure recessive), 65% at -0.9, 1 1% at -0.5, 
1 5 and 5% at +1 (pure dorninant). The remaining parameters were held at their reference values 
(a = 5xl0~ 8 , 1 -P = 0.8, r = 0.4, pi = 0.1, a A 2 = 0.02, dla = 0). The pooling fraction was set to 
p = 0.2 for this series of panels and represents a near-optimal fraction for additive variance, 
dla - 0. 

20 For pure recessive traits, dla = -1 in Panel A (82% dorninance variance for p\ = 0. 1), the 

estimate for N approaches an apparent minimum at p = 0, and the assumption of normality of 
the test statistic is no longer valid. When dla is -0.9 and the dorninance variance contribution 
has dropped to 65%, the curves for AT in Panel B start to flatten, and when dla = -0.5, in which 
the heterozygote mean is still three-quarters of the way towards the minor-allele homozygote, 

25 the curves in Panel C are nearly indistinguishable from the results for pure additive (not 
depicted) and pure dorninant, Panel D. 

These results again signal that pooling methods for quantitative phenotypes are more sensitive 
to changing additive variance than to changing dorninance variance. The dorninance variance 
30 is only significant in regions where the additive variance vanishes, dla = V(p\ —p-i). This 
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region occurs near -1 for a low-frequency allele, indicating that association studies have weak 
power to detect low-frequency recessive alleles or their high-frequency dominant counterparts. 

These effects are shown in greater detail in Fig. 7. The population requirements are shown as 
a function of the dominance ratio dla at the fixed pooling fraction p = 0.2 in Panel A and at the 
optimal fraction in Panel B. Other than the region in which the additive variance vanishes, 
dla = -1 .125 for p\ = 0.1, the results in both panels are similar and show little dependence on 
dla. This is true even in regions of strong over-dominance, dla > 1, and under-dominance, 
dla < -2. Near the region of vanishing additive variance the optimal pooling fraction p drops 
rapidly, as seen in Panel C, and the results for the optimal p and p = 0.2 differ. 

Example 5. False-Positive Rate and False-Negative Rate 

When the widths of the distribution of the test statistic under the null and alternative 
hypothesis are approximately equal, the equation for the population necessary to detect 
association has the form Nac(z a - zi- fi ) 2 . When a becomes small, the behavior z a ~ [-2 
ln(a)] 1/2 for small a, extracted from an asymptotic expansion for <$>(z) (Mathews and Walker 
1970), leads to the asymptotic behavior iV~ 2 ln(l/a), which is seen clearly as the linear 
behavior in Panel A of Fig. 8. The remaining parameters are fixed at their reference values 
(1 - P = 0.8, r - 0.4, pi = 0. 1 , a A 2 = 0.02, dla = 0). Compared to a whole-genome scan with 
a = SxlO^ 8 (z a = 5.33) and a 20% false-negative rate, for example, which requires 2400 
individuals pooled with the unrelated-random design or 3000 siblings pooled with the 
unrelated-extreme design, a test of 1000 candidate polymorphisms with a = 5><10~ 5 (z a = 3.89) 
requires 1400 unrelated individuals or 1800 siblings, while a test for association between a 
single polymorphism and a single phenotype, a = 0.05 (z a = 1.64), require 400 unrelated 
individuals or 500 siblings. The optimal fraction p for pooling is not sensitive to the choice for 
a itself, as seen in Panel B. 

The effects of varying the false-negative rate fi are similar to the effects of varying a because 
the population requirements depend predominantly on the difference z a - z\^ rather than on 
the value of either alone. For small values of p,N~ 2 ln(l/#. This linear behavior is 
demonstrated in Panel C, where the remaining parameters have their reference values except 
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for a = 5x10 5 corresponding to a test of candidate polymorphisms. The optimal pooling 
fraction p does not depend sensitively on fi, as shown in Panel D. 

4. Model 2 

5 4.1 Variance components model 

A standard variance components model is used to describe the joint phenotype-genotype 
probability distribution. A quantitative phenotype X, standardized to mean 0 and variance 1, is 
hypothesized to be affected by the genotype G at a biallelic locus with minor allele A\ and 
major allele A 2 occurring at population frequencies p and l-p. More generally, A2 may 

1 0 represent any of a number of alternate alleles, and l-p their aggregate frequency. The 

population is assumed to be random mating and in Hardy- Weinberg equilibrium. The symbol 
P is used to denote a probability, and the genotype frequencies P(G) are p 2 , 2p(l-p), and (1- 
pf for A\Au A\A 2i and A 2 A 2 respectively. The frequency of allele A\ in genotype G, denoted 
p G , is 1 for AiAu 0.5 for A\A 2 , and 0 for A 2 A 2 . The variance of the allele frequency for an 

15 individual, denoted <j p 2 , is p(l-p)/2. 

The frequency of a genotype combination for a sib pair is denoted P(GuG 2 ). Only full sibs are 
considered. The probability distribution PffiiA) of the 9 possible combinations of sib-pair 
genotypes, shown in Table HE, can be derived by corisidering all possible parental mating 
20 types and their offspring genotype distributions Q (i. Neale, MC and Cardon, LR: 

Methodology for Genetic Studies of Twins and Families; in NATO ASI Series D, Behavioural 
and Social Sciences, vol 67. Dordrecht, Kluwer Academic, 1992). 

25 The effects p(G) of genotype G are to displace the phenotypic mean by a, d, and ~a for 
genotypes A\A Xy A\A 2y and A 2 A 2 respectively, with the raw mean (2p~l)a + 2p{\-p)d then 
subtracted to preserve the overall phenotypic mean of 0. The relationship between d and a 
determines the inheritance mode of allele A\ : d ~ —a for a recessive allele, +a for a dominant 
allele, and d = 0 for an additive allele. 

30 
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The phenotypic variance contributed by the genotype G can be partitioned into an additive 
component g a 2 and a dominance component an, with 
va 2 + <Jd = 2p{\-p)[a-d(2p-\)f + *p\\-p?# . 

As will be seen below, this partitioning is important because association tests are sensitive 
primarily to u A 2 f not to a D 2 . Note that a A 2 may be much larger than a D 2 even when the 
inheritance is purely dominant or recessive. Remaining genetic and environmental factors 
contribute a residual variance g r 2 = 1 - (va+gd) to the total phenotypic variance. 

The probability density of phenotypic values for sib pairs is denoted /(X l9 X 2 y It can be 
expressed as a mixture of 9 conditional densities, one for each possible sib-pair genotype, 
WX)= £ f(Xi&\Gi 9 GiWGi,G2). 

The mean of A} is ja(G,) for sib i = 1 or 2; both X\ and X 2 have residual variance g r 2 and 
residual covariance (due to shared residual genetic and environmental factors) /r. The total 
phenotypic correlation t for sib pairs is 
t = t K + G A 2 /2 + G D 2 /4 

when effects from genotype G are included. 

Although^ and X 2 are natural coordinates for expressing sib phenotypic values, the 
correlation between sibs complicates the joint distribution of X\ and X 2 . A simpler joint 
distribution is obtained by noting that the sum and difference of X\ and X 2 are completely 
uncorrelated. These orthogonal coordinates representing sib mean and sib difference are 
denoted X+ and X. , with 
X ± = (X x ±X 2 )/2. 

The probability distribution in these orthogonal coorwnates,X^+^|GiC?2)> factors into the 

product offiX+\G u G 2 ) andfiXJlGi&l with 

XAyGi,Gy= (2no± 2 T 1 %xp{-[X ± - M ^G u G 2 )f/2G ± \ using 

M±(G U G 2 ) » [\x(G l )±ii(G 2 )]/2 and 

<* 2 = crjt 2 (l±fi0/2. 

It is also convenient to define pair-mean and pair-difference allele frequencies p±(G u G 2 ) as 
P±(GuG 2 ) = (p G ±p C2 )/2. 
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The variance of the pair-mean and pair-difference variables may be expressed more generally 
for sib-ships of size s, with genotypic correlation r between any two sibs within a sib-ship, as 
Var(X ± ) = a R 2 T± and 
Var(p ± ) = c/tf ± 
5 where 

T± = [l±(s-l)t R ]/s and 
R ± = ll±(s-l)r]/s. 

The family size s is 2 for sib-pairs, and the genotypic correlation r is 0.5 for full sibs. 

10 In addition to X\ yX% and X+JC coordinate systems, we also introduce a Mahalanobis coordinate 
system. In this metric, a sib-pair is described by a radial coordinate b, which expressed how 
extreme the pair of phenotypic values is, and an angle q>, which determines whether each sib 
has a positive or negative phenotypic value. The transformations relating the Mahalanobis 
variables to the pair-mean and pair-difference variables are 

15 X+ — a+ &sin<p and 
X = ct+ 6coscp. 

The probability distribution in Mahalanobis coordinates is 
A^MGuGi) = (27i) _1 exp[-(Z> sin(|>-v+) 2 /2] exp[-(b cos(p-vJ) 2 /2] with 
v± = ja±/CT± . 
20 This distribution satisfies 

J d<pj dbbflbMG u Gz)=l. 

0 0 

In the absence of a contribution from the QTL,,/(&,<p|Gi,G2) reduces to (27c)" 1 exp(-6 2 /2) and 
the Mahalanobis probability density is independent of the phase q>. Contour lines of equal 
probability density in the X\—. X 2 plane are ellipses tilted at 45° with a ratio of major axis to 
25 minor axis of [(!+/)/( l-t)] iJ2 . 

4.2 Test statistic and pool design 

The tests of association described here depend on detecting differences in allele frequency in 
30 DNA pooled from individuals chosen from a large repository DNA repository. The allele 
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frequency in the upper pool, with individuals selected to have higher phenotypic values, is 
denoted p v \ the allele frequency in the lower pool, selected for lower phenotypic values, is p L ; 
and the test statistic is pu-pu denoted bp. 

The overall repository size is denoted N, composed entirely of either ^unrelated individuals 
or N/2 sib pairs. The upper and lower pools each hold n samples, and the pooling fraction p is 
defined as nIN. 

For an unrelated population, only one design is described: selecting the n individuals whose 
phenotypic values arc at the upper and lower tails of the distribution, thus defining upper and 
lower thresholds X t and A/,. This is termed the unrelated-population design. 

A corresponding design for sib pairs is termed unrelated-random. In this design, one sib is 
chosen, at random, from each sib-ship to generate a population of N/2 unrelated individuals. 
Individuals at the upper and lower tails of this unrelated subset are then selected for pooling. 
The unrelated-random design for N/2 sib pairs with pooling fraction p is essentially equivalent 
to the unrelated-population design for N/2 individuals with pooling fraction 2p. 

A second design selecting only unrelated individuals is termed the Mahalanobis design. The 
pair-mean^, and pair-difference AL are used to define a Mahalanobis coordinate b according 
to 

b 2 = 2AVV(1+/) + 2A1 2 /(1-*). 

The 7i sib-ships with the largest magnitude b and a positive pair-mean AT+ are identified, and 
the sibling with the larger phenotypic value is selected for the upper pool. Similarly, the n sib- 
ships with the largest b and negative pair-mean are identified, and the sibling with the more 
negative phenotypic value is selected for the lower pool. 

Two remaining designs select both members of a sib pair for pooling. The pair-mean design 
selects each sib-ship as a family unit based on the phenotypic mean of the pair. The n/2 pairs 
at the extreme upper and lower tails of the distribution of phenotypic means for sib-ships, 
comprising n individuals each, are selected for the upper and lower pools respectively. The 
upper and lower thresholds are again termed X v and X L . 
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The pair-difference design selects individuals based on the difference of phenotypic values 
within each sib-ship, or equivalently on the magnitude of within-family phenotypic variance. 
The n sib-pairs with the greatest within-family variance are identified. Within each pair, the 
5 individual with the higher phenotypic value is selected for the upper pool, and the individual 
with the lower phenotypic value is selected for the lower pool. The threshold for the 
magnitude of the difference \X\-X 2 \/2 for selecting families is termed X T . 

Since the X+ and X. are uncorrelated within each family, the results of the pair-mean and pair- 
10 difference designs are statistically independent and may be combined to yield a single, 

r*nt*=»ntialTv mnrf nnwfirfiil tftRt 

t ,~~~-^-'.~~~j t 

4.3 Test power 

1 5 Under the null hypothesis H 0 , the expectation for pu and p L is the population mean allele 
frequency, and the expectation for the test statistic hp is zero. Under the alternative 
hypothesis H u the expectation Ei(Ap) for Ap is non-zero. The power of a test of Ap depends 
on the magnitude of Ei (Ap) compared to the variation of Ap under Ho and Hi, and in turn on 
the variation of pu and pi* 

20 

Both/?t/ and/?£ follow multinomial distributions defined by the probability that an individual 
with zero, one, or two copies of allele A\ is selected for pooling. When the total number of 
individuals selected for each pool is large and the number of copies of allele A\ in each pool is 
also large, the multinomial distribution giving Ap is described accurately by a normal 
25 distribution. The variance of Ap under Hq is denoted aQ 2 /n and the variance under H\ is 

denoted a\ 2 /n, where o~o 2 and o~i 2 depend on the model parameters and the pooling design. The 
number of individuals required for type I error rate a and type II error rate (5 is 
n = (zaOb-^i-pOi^/E^Ap) 2 . 

The terms z a and zi_p are normal deviates corresponding to the error rates, 
30 0>(z a )=l-a, and 0(zi_p) = j3, 

where <X>(z) is the cumulative probability function for the standard normal distribution, 
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z 

0(2) = J dz (27i)" 1/2 exp(-z 2 /2). 

—to 

The significance level a is for a one-sided test, which is appropriate for association tests for 
disease-susceptibility markers. If markers for protective polymorphisms are also sought, the 
significance for a two-sided test is more appropriate. 

The method used here to optimize test designs is to specify the error rates a and p , then 
calculate the selection criteria that minimize the total repository size N required to achieve 
these error rates for specific genetic models. The method is outlined below, along with a 
summary of analytical approximations for the repository sizes required for different population 
structures and pooling designs. Comparisons of the analytical approximations with essentially 
exact numerical calculations are found in the Results section, and mathematical details are 
provided in the Appendix. 

To optimize N, a trial value of the fraction p is chosen. Next, the threshold phenotypic values 
that select n = pN individuals for each pool are derived from the distribution of phenotypic 
values. Depending on the pooling design, these threshold values may refer to phenotypes for 
unrelated individuals, the Mahalanobis measure b, the pair-mean measure X+ 9 or the pair- 
difference measure JC The threshold values are used to calculate the probabilities O^G) and 
0z((7) that an individual selected for the upper and lower pools has a particular genotype G. 
These probabilities provide the expectation Ei(Ap) of Ap under H u as well as the variances 
a 0 2 /n and a 2 In of Ap under H 0 and H\. Values of Ap larger than z a ob/n 1/2 are significant at 
level a, and the corresponding power of the test is 
1-p - Otip^EiiApyzaObVoi}. 

Since the terms E^Ap), ct 0 2 , and cri 2 depend on p but not on AT or n, this equation may be 
inverted to find N as a function of 1-p, 
7/= (z a oo -zi_pai) 2 / pEi(Ap) 2 . 

Optimization proceeds by a search for the value of p giving smallest N. 

For complex traits, the total variance <j a 2 +Od 2 from any QTL is small, and a R 2 is close to 1 . 
This suggests that the displacements a, d, and -a are also small relative to a R and motivates a 



34 



WO 02/16643 



PCT/US01/25924 



perturbation expansion of E^Ap) and cti 2 in terms of \l(G)/gr. The expression for Ap is linear 
in the expansion parameter, while cti 2 is identical to ao 2 to first order. Collecting the lowest 
order terms, the result for the required repository size N is proportional to <jr'/<j a 1 . The 
constant of proportionality depends on the pooling fraction p, the phenotypic correlation 
5 between sibs, the specified values of a and p, and the pooling design, but not on any 
properties of the genetic model other than a/- 

In deriving the optimal test designs and estimating the test power, we assume implicitly that 
there is no measurement error in either the allele frequency p or the allele frequency difference 
10 Ap. For the allele frequency /?, we show in the Results that either using the mean value (pu+p- 
l)/2 or measuring the allele frequency on a large pool of individuals unseiected for phenotypic 
value should provide an adequate estimate for p. We also discuss the reduction in power due 
to measurement error in Sp. 

15 Unrelated design 

When a repository contains N unrelated individuals, the analytical approximation for the 
required repository size, derived in the Appendix, is 

Unrelated = (p/2^ p 2 ) (z a -Zi-$f g//<J A 2 . 

20 This functions is a minimum at p — 0.27, with p/2.yp 2 = 1 .24. 

If the population consists of sib pairs rather than unrelated individuals, an unrelated sub- 
population of N/2 individuals may be constructed by selecting one sib at random from each 
pair. A direct extension of the above result for unrelated populations yields 
25 A^dom-sib = 2[(2p)/2y 2p 2 ] (z^i-p) 2 <*r/<Ja 2 

for the sib-pair population. The repository size required for sib pairs is twice as large as for 
unrelated individuals, with a pooling fraction half as large. 

Mahalanobis design 

30 

The analytical approximation for the number of individuals required for the Mahalanobis 
design, derived in the Appendix, is 
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^Mahai = (2p)' 1 [(2V^) + 0(-6 p )/p(27t) 1/2 r 2 [ RJT + m + RJT. m I" 2 M_p)V/o/ - 
The initial geometrical factor depends only on the pooling fraction. It is miiiimized at 
p = 0.188 with a value of 2.90, yielding 
*Mah a . = 2.90 [ RJT + m + RJTJ n T 2 (z a -z^)W/GA 2 
5 for this pooling design. 

Pair-mean design 

The analytical approximation for the repository size required by the pair-mean design is 

1 0 = (splly 2 ) (TJR*) (Za-Zl-p) 2 Or 2 /o A 2 9 

where s = 2 for sib pairs. As with the unrelated design, the factor p/y p 2 is optimized with a 
pooling fraction of 0.27, yielding 

^pair-mean = 2.47 (TJR+) (z a -Z^) 2 <Jr/u A 2 

for the required repository size. 

15 

Pair-difference design 

An analytical approximation for the repository size required by the pair-difference design is 
^pair-diff = {spl2y 2 )(JJR-){z^-zi^f g r /g a 2 . 
20 The factor p/y p 2 is minimized with a pooling fraction of 0.27, and 
AW^fr= 2.47 (TJR-)(z a -z^) 2 v R 2 /o A 2 
is the required repository size. 

Combined pair-mean and pair-difference design 

25 

Because the sib-mean variables X+ and p+ are uncorrelated with the pair-difference variables 

and p-, the pair-mean and pair-difference estimators are independent and may be combined 
into a single test. The combined test uses the measured value of Ap±, where the + and - signs 
refer to the allele frequency differences found for the pair-mean and pair-difference pools, to 
30 obtain an estimator for a A fa R , The pair-mean and pair-difference estimators, Q±» each with 
expectation g a /o r , are 
Qt - (T ± m m(p/2y p a p )bp ± , with 
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Var(Q ± ) = (sp/2y p 2 N)T ± /R ± 

from the expressions provided in the Appendix for Var(Ap±). These expressions differ by the 
factor sR± from a similar expression provided by Ollivier et al. (1997) which incorrectly 
neglected the contribution of family structure to Var(Ap±). 

5 

The combined minimum-variance estimator Q having expectation o^/a*, constructed by 
weighting the pair-mean and pair-difference estimators according to their inverse variances, is 
Q = (p/2y p <T p ) [QUT + ) + 0R-77L)]- 1 (r + " 1/2 Ap + + T-- 1/2 Ap_), with 
Var(0 = (sp/2y 2 N)[(RJT+) + (RJT-)T l - 
10 An analytical approximation for the repository size required using the combined estimator is 
N comh = 0p/2y p 2 ) [(R+/T+) + (JUT.)]-' 1 <frr*vtf*£laf. 

At the optimal pooling fraction of p = 0.27, the factor (sp/2y p 2 ) is 2.47. Since the variance of 
the individual estimators are identical under Ho and H\ 9 the repository size for the combined 
estimator is simply the reciprocal of the sum of the reciprocal repository sizes required for the 
15 individual estimators. 

4.4 Regression tests 

Regression tests requiring individual genotyping provide a benchmark for the efficiency of 
20 tests on pooled DNA. A regression test assesses the significance of the regression coefficient 
m in the model 
Xi = m(prp) + s/ 

where i labels an observation, X t is an observed phenotype with mean 0 and variance 1 , pi is 
the corresponding observed genotype with mean p 9 and 8/ is the residual contribution not 
25 explained by the model. For //unrelated individuals, the phenotypic and genotypic variables 
in the regression test are the individual X t and pi values. For N/2 sib-pairs, they are the pair- 
mean and pair-difference variables X± and p± for each pair. 

The expectation of the regression coefficient m is 0 under Ho and is 
30 E(m) = oa'&p, 

under Hi . The variance of the estimator, assumed identical under both hypotheses with 
negligible error when Or is close to 1, is 
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Var(m) = (s/N) VarfeyVarfo) = {sIN){T/R)^!o p 2 y 

where s = \ for unrelated individuals or 2 for sib-pairs, and T/R = 1 for unrelated individuals 
and T±/R± for pair-mean and pair-difference variables. 

5 The expectation and variance of the test statistic are related to the false-positive rate and power 
through the equation 
[VarC/*)]- 1 = (z a -z IHJ ) 2 /[E(m)] 2 . 

Substitution into this equation yields the repository size requirement for the regression test, 

10 The combined estimator formed from the pair-mean and pair-difference estimators has a 
repository size requirement of 

> T r r» im * r-« im -t— 1 r \2 2/ 2 

J Vregr ~ J- + ~t~ K-Jl-j \Z a -Z\-&) l^A • 

4.5 Computational methods 

15 

Results for required repository sizes were obtained numerically using computations converged 
to lxlO" 6 []. C I^ess, WH, Teukolsky, SA, Vetterling, WT, and Flannery, BP: Numerical 
recipes in C, the art of scientific computing, ed 2. Cambridge, UK, Cambridge University 
Press, 1997.) 

20 Brent's root-finding algorithm was used to deterrnine the threshold values Xu and Xl for the 
upper and lower pools for a given pooling design and pooling fraction p; Brent's optimization 
algorithm was then used to find the p with maximum power. While the reported results are 
based on a normal approximation for the allele frequency difference Ap, results were also 
obtained using the underlying multinomial distribution for the unrelated-population design. 

25 The difference between the numerical results for the multinomial and normal distributions was 
typically less than 1 %. The repository size required for the pooling combined estimator was 
obtained numerically as the reciprocal of the sum of the reciprocal exact sizes required for the 
pair-mean and pair-difference pooling designs. Using a 750 MHz Pentium m running Linux, 
the root-finding and minimization for each parameter set required less than 0.01 sec for each 

30 design. 
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To assess the error made by assuming a normal distribution for Ap, we also performed tests in 
which Ap was calculated exactly according to a multinomial distribution. Results for the 
required repository size based on the normal distribution were then compared to the repository 
size based on a multinomial distribution. The two results for TV differed by no more than 5% 
5 when the number of copies of the minor allele summed over both pools is greater than 60. 
They differ by approximately 10% when the number of alleles is 10, with the normal 
distribution underestimating the exact repository size. These differences are not visible on the 
scale of the figures. 

10 Appendix 4A: Mathematical details 
4A.1 Unrelated design 

The unrelated design considers a population of N unrelated individuals. Upper and lower 
1 5 thresholds Xu and X L are defined using 
p = £ G 0{-[Xu^G)Va R }P(G) and 
p = X G 0{[A^(G)]/cr^}P(G), 

which may be inverted numerically to find Xu and X L as functions of p. The probability that 

an individual selected for a pool has genotype G is denoted Qu(G) for the upper pool and 0x(G) 
20 for the lower pool, 

OuiO) = P^Ol-lXa-ixiC^/o^PiG) and 

§dG) = p" 1 0{[^-^(G)]/a i e}P(G). 

The expected allele frequencies under H\ are 

Bi(pu) = 2 G Bu(G) p G and 
25 Ei(pi) = Sg e L (G) p G , with 

Ei(Ap) = E(pa)-E(px). 

The variance of the test statistic can be obtained from the moments of a multinomial distribution fj ( UI Beyer WH 
(ed): CRC Standard Mathematical Tables, ed 27. Boca Raton, CRC Press, 1984.), 
_2 = o 7>{rz\r* - 2 \ _ O^ 2 = Or* 2 and 

30 a, 2 = 2c [etXG) + 9i(G)W - (pv 2 +Pl). 
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Thus, when p is specified, the terms in the expression for the repository size N, (z a Ob-zi~ 
pai) 2 /pEi(Ap) 2 , may all be calculated numerically, and the optimal p is obtained by numerical 
miriimization of Af as a function of p. 

An approximate analytical expression for Af may be obtained when cr^ 2 is close to 1 by noting 
that 

0(z-5) = 0(z)-y5, 

where y = (27c)" l/2 exp{-z 2 /2}, is correct to lowest order in the small parameter 5. Using 

jJ.(G)/aj? as the small parameter S, the phenotypic thresholds are 

X v = ~X L ~ -g r 0 _1 (p), and 

the expected difference in allele frequency is 

E(Ap) = ly^cPi^C^pGVpo-R = 2y p a p a A lpa R , 

where y p = (27i)~ 1/2 exp{-[<D~ 1 (p)] 2 /2}. To the same order of approximation in \x(G)/a R9 both 
o*o 2 and cti 2 may be replaced with 2a p 2 . The resulting approximation for the required 
repository size is 

^related = (P^p 2 ) (z a -Z X ^) 2 Ur 2 /o A 2 . 

The minimum occurs at p = 0.27 andy p = 0.33. 
4A.2 Mahalanobis design 

For the Mahalanobis design, thresholds bu and fa for the radial coordinate are established for 
the upper and lower pool by solving the following normalization equations: 

p =(1/2) £ P(G U G2) J d<?) dbbflbMGuG 2 ) and 
p = (l/2) £ P(G U G 2 ) 2 ] dip) dbbJ{bMG u G2). 

The factor of (1/2) arises because only one individual is selected from each sib pair. If the 
radial coordinate b is larger than the threshold value, the phase angle cp determines which sib is 
selected for which pool: the sibling with genotype G\ is selected for the upper pool if 
0 < cp < 7i/2 and for the lower pool if n < q> < 3n/2; the sibling with genotype Gi is selected for 
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the upper pool if te/2 < (p < it and for the lower pool if 3n/2 <<p<2n. The genotype : 
probabilities 0c/(G) and Bl(G) for the upper and lower pools maybe written 

it /2 oo 

Bu{G) = p" 1 Y,g>P(G,G') \ dq \ db bj[b,<p\G,G') and 

0 by 

3k/2 » 

= P" 1 Z<?P(G,G r ) \ d<$>\ db bflpMG&)> 

5 where symmetry between siblings has allowed the change in integration limits for q> to 

consider only the regions where sibling 1 is selected. Once p is specified, the thresholds for b 
may be obtained numerically, and Ei (Ap) may be obtained from &u and Q L > Numerical results 
for the required repository size may then be obtained as outlined above for the unrelated 
design. 

10 

An analytic approximation for the repository size requirement may be obtained by noting that 
ftbMGuGi) = (27c) _1 [1 + (Z>v+)cosq> + (6v_)sin(p] exp(-Z> 2 /2) 

to lowest order in the gene effect u(G). The normalization condition leads to the equation 
p = (l/4)exp(-V : /2), 

1 5 with bu=b L = b p defined in terms of the pooling fraction p. Hie genotype frequencies in the 
upper and lower pools are 

0u.l(G) = P(G) ± *g>P(G,G') (v + + v.)[(2V«) + 0(-& p )/p(2?t) l/2 ], 

where the upper pool has the + sign and the lower pool the - sign. The expected allele 
frequencies in the upper and lower pools are 
20 E0?«z) =P ± [(2 V«) + ®(-b p )/p(2n) m ] [ R+/T+ in + RJT- m ] a p a A /<s R , 

where the upper pool has the positive deviation from p and the lower pool the negative 
deviation. These results are derived using the identities 

<7,,C?j G„G 2 

where r is the genotypic correlation (0.5 for full-sibs). Since &u(G)+Ql(G) is 2P(G), the 
25 variance term <ti 2 is equal to a 0 2 , and both are equal to 2o> 2 because all the pooled individuals 
are unrelated. The approximate expression for the number of individuals required for the 
Mahalanobis design is 

^Mahalanobis = (2 P )- l [(2 V*) + 0>(-fe p )/p(27 t ) ,/2 J* [ R*IT+ m + lUT- 1 * J* fer*H»> VW- 
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The minimum occurs at p = 0. 1 88. 
4A.3 Pair-mean design 

The fraction p of the total population selected according to pair-mean pooling is defined in 
terms of the upper threshold X v and the lower threshold X L as 

P = Z ^ ) ft)OHWGi J G 2 )]/a + }= Ya P(G h G 2 )<£>{[X L -MGuG2W<y+}. 

G it G 2 

The genotype distribution describing the individuals selected for each pool follows a 
multinomial distribution based on sib-pair genotypes rather than individual genotypes, such 
that 

O t ,0 3 G X ,G 2 

with 

Qu(GuG 2 ) = P~ l O{-[Xu-^(G)]/<j R }P(GuG2) and 
Ql(G u G2) = p- l O{[X L -^G)]/<y R }P(G,,G2). 
The expected allele frequencies under Hi are 
Ei(py)= Z Qu(GuG2)p+(GuG 2 )and 

Ei(pi)= Z QL(Gi,G2)p + (G u G2),vri1h 

G t ,G 2 

and/?+(Gi,G2) is the pair-mean allele frequency as defined previously. The terms giving the 
variance of the test statistic under H 0 and H\ are 

°o 2 = 2s { Z P(G l9 G 2 )[p+(G u G 2 )] 2 } - 2sp 2 = 2^ + o> 2 = 3o> 2 and 

C„<7 2 

<7 lt <?2 

The factor 5 = 2 accounts for the family structure, as nls rather than n measurements of/?+ are 
used to determine the allele frequency of each pool. The variance under the null hypothesis 
may be derived directly from the sib-pair genotype frequencies, or more simply by noting that 
the variance of the mean allele frequency for a sib-pair is R+<j p 2 , which is (3/4) of the variance 
a/ for an individual. Taking the mean of n/2 such terms reduces the variance for each pool by 
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nil. The total variance is obtained by multiplying by 2 for the number of pools, yielding 3a p 2 . 
Given p, the pooling thresholds are obtained numerically, then used to calculate Ei(Ap) and 
a i 2 , yielding a numerical result for the repository size AT as a function of p. 

5 An analytical approximation follows the same derivation used for the unrelated design, except 
that individual genotypes are replaced by sib-pair genotypes, and individual phenotypes, 
phenotype offsets, and allele frequencies are replaced by their pair-mean analogs. The upper 
and lower pooling thresholds are 
Jt t/ = -X i = -a + <D- l (p), 
10 and the allele frequency difference between pools is 

E(Ap) = 2y p [ 2>(G, , G 2 ) u+(G ls G 2 ) p + {G u G 2 ) ]/p<* = (2vp/p) {RJT^ n )<5 p a A la Ri 

G it G 2 

where y p is the height of the standard normal density at 0 _1 (p) as before. The contributions of 
the corresponding low-order terms in cm 2 cancel, and the variance of Ap is the same under both 
hypotheses. The repository size required by the pair-mean design is 

15 TVpai^an = (^P^p 2 ) (TJR+) (z a ~Zi.0 2 <J* 2 /<j/. 

4A.4 Pair-difference design 

Under the pair-difference design, a sib pair is selected if the pair-difference JC is larger in 
20 magnitude than a threshold Xt, 

G„G 2 G„<7 2 

In the first term, sibling 1 has the higher phenotype and is selected for the upper pool, and 
sibling 2 is selected for the lower pool. In the second term, the roles of the siblings are 
reversed. Multinomial distributions are defined as 9iy(Gi,G 2 ), the genotype probabilities for 
25 sib pairs in which sibling 1 enters the upper pool, and Ql(G\,G 2 ), when sibling 1 enters the 
lower pool. For selected pairs, 
1- E {^G l7 G 2 ) +9z(G 1 ,G 2 )}. 

O li G 2 

This normalization implies that 

QiAGuGi) = (2p)- 1 P(G 1 ,G 2 ) ®{[\i-{GuG2)-XT]/a-} and 
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Ql(G u G 2 ) = (2p)- 1 P(G 1 ,G 2 ) OH^ftWaL}. 

Due to symmetry, 0uiGuG 2 ) and B L (G 2y G\) are identical. The expected allele frequency 
difference between pools is 

Ei(4p)«£ 2Qu(GuG2)p-(GnG2)- £ 2$ L {G u G 2 )p-{G x ,G 2 y, 

5 by symmetry, each term contributes E(Ap)/2. To calculate the variance of 6p, it is important 
to note that the normalization of Gy and Qj, to 1/2 implies that the probabilities for a 
multinomial distribution are 2Bu and 2Q L , with both 6u and 0 Z equal to P(G U G2)I2 under the 
null hypothesis. The terms giving the variance under the null hypothesis and the alternative 
hypothesis are 

10 gq 2 = 2s P(G u G 2 )p- 2 = 2sJLo p 2 - o> 2 and 

C„<7 2 

a! 2 = 2 J] [2Qu(Gi ,G» + 20 £ (G I ,G 2 )>. 2 - E(Ap) 2 . 

The value of ao 2 under the null hypothesis may be obtained more simply by noting that the 
allele frequency difference between two siblings has variance 07, 2 , and the measured allele 
frequency difference is the mean of n such terms. 

15 

The repository size required to detect association may be determined exactly by numeric 
calculation of the threshold value Xt as a function of the pooling fraction p. This value is then 
used to determine E(Ap), ao 2 , and cri 2 . 

20 An analytic expression accurate when a* 2 is close to 1 may be derived using the same * 

technique as for the previous pooling designs. The analytic estimate for the threshold value is 
X T = -o_<$r\p) 

and the allele frequency difference is 

E(Ap) = 2y p [ 2>(G 1? G 2 ) u^Gy^GuGy^ 

25 where y p is the height of the standard normal density at 0 _1 (p). The variance term <3\ equals 
ao 2 to this order of approximation, and the repository size required by the pair-difference 
design is 

tfpair-diff = {sp/2y 2 ){TJR.){z a -zx^f <J R 2 /c A \ 
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Example 4.1. C mparisons with individual genotyping 

When the effect of a QTL is small and the residual variance <jr is close to 1, the analytic 
expressions for repository size requirements are exact. In this limit, we begin by comparing 
the efficiency of pooled DNA designs relative to individual genotyping. 

The repository size requirements of pooled DNA methods are shown in Fig. 9 relative to the 
corresponding regression tests for the same family structure. Methods plotted are the 
unrelated, pair-mean, pair-difference, and combined designs, as well as the Mahalanobis 
design. Except for the Mahalanobis design, the ratio of repository size requirements is 
independent of all model parameters except for the fraction p of individuals whose DNA is 
pooled. Furthermore, the ratio is independent of family structure for these matched 
comparisons. The optimal pooling fraction is p = 0.27. The curves are flat near the mniimum, 
indicating that pooling fractions close to the optimum give near-optimal results. Repository 
sizes must be increased by 1 .24x to attain the same power as would have been achieved with 
iV individual genotypes. 

The repository size required for the Mahalanobis design is shown relative to that required for 
the combined regression test. This ratio depends on the residual phenotypic correlation t R 
between siblings, and a typical value t R = 0.6 has been selected for illustrative purposes. The 
minimum at 0. 1 88 is independent of t R , and the repository must be 1 .55x larger than that for a 
genotyping study. 

In Fig. 10, the performance of the Mahalanobis design relative to the combined regression test 
for individual genotypes is shown as a function of the residual sibling phenotypic correlation 
t R , with the optimal fraction 0. 1 88 used to construct the upper and lower pools. The ratio of 
repository sizes is roughly 1.5 until the phenotypic correlation rises above 0.6, at which point 
the repository size requirements for the Mahalanobis design begin to rise more steeply. 

Example 4.2 Comparisons between unrelated and sib-pair populations 
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In Fig. 1 1, the repository size requirements for association tests using DNA pooled from sib 
pairs are shown as a function of the residual sibling phenotypic correlation tR, relative to the 
repository size required for a test of DNA pooled from unrelated individuals. Ratios larger 
than 1 indicate that the population of N unrelated individuals is more powerful than a 
5 population of N/2 sib pairs, while ratios smaller than 1 indicate that the sib-pair population is 
more powerful. These ratios are derived from the analytical approximations derived for 
complex traits. 

For designs using only 2 pools, a population of unrelated individuals is more powerful than a 
10 population of sib pairs except for large values of the sibling phenotypic correlation, t R > 0.75, 
at which point the Mahalanobis and pair-difference designs become more powerful. Below 
this phenotypic correlation, the Mahalanobis design is substantially more powerful than the 
other sib-pair tests; above this correlation, the pair-difference design is only slightly more 
powerful than the Mahalanobis design. 

15 

The slope of the pair-difference repository size requirement is 3x larger than the slope of the 
pair-mean population requirement. Thus, relative to the pair-mean design, the pair-difference 
design decreases in power rapidly as t R falls below 0.5 and increases in power rapidly as t R 
rises above 0.5. 

20 

The combined 4 pool test using pair-mean and pair-difference pools is uniformly the most 
powerful sib-pair design for all values of tR. Its worst-case performance relative to an 
unrelated population occurs when t R is (3 1/2 +l)/(3 1/2 -l), or 0.2679, where it requires a 
population 7% larger. The unrelated and sib-pair tests require the same repository size when 
25 the phenotypic correlation is 0.5, and the sib-pair test becomes much more powerful for equal 
repository sizes for larger values of t R . 

Example 4.3 Sensitivity to QTL effect size, allele frequency, and inheritance 
mode 

30 



According to the analytic theory, the necessary size of the study population for pooling tests is 
inversely proportional to the additive variance contributed by the QTL relative to the residual 
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phenotypic variance, gaIgr> and independent of any remaining parameters of the genetic 
model. Here we provide exact numerical results to assess the region of validity for the 
analytical approximations. For these numerical results, the type I error rate a is 5x10 and 
the type II error rate p is 0.2 to provide adequate power and an acceptable number of false- 
5 positives for a whole-genome scan. For consistency in Figs. 4-6, the unrelated-population 
design is a dotted line, Mahalanobis is a thin line, pair-mean is dashed, pair-difference is dot- 
dashed, and the combined estimator sib-combined is a thick line. 

A single representative value for the sibling phenotypic correlation t& was selected for these 

10 tests. This correlation is equal to half the genetic heritability plus the shared environmental 
contribution to the total variance of a complex trait. For cancer, heritability has been 
estimated at 20% and environmental factors at 7% (Verkasalo et aL, 1999); for systolic and 
diastolic blood pressure, hcritabilities are estimated at 10% to 50% and environmental factors 
at 20% to 40% [,] (I&elius et a.., 19S3; Perusse et ah, 1989); heritability for cholesterol level is 

1 5 estimated at 70% to 90% (Austin et aL, 1987) and environmental factors for serum lipids are 
estimated 15% [] Heller DA, de Faire U, Pedersen NL, Dahlen G, McCJearn GE: Genetic 
and environmental influences on serum lipid levels in twins. N Engl J Med 1993; 328: 1150- 
6). Additional heritability estimates are 20% to 40% for Type 2 diabetes mellitus (NIDDM) [ 
* Watanabe RM, Valle T, Hauser ER, Ghosh S, Eriksson J, Kohtamaki K, Ehnholm C 

20 Ehnholm C, Tuomilehto J, Collins FS, Bergman RN, Boehnke M: Familiality of quantitative 
metabolic traits in Finnish families with non-insulin-dependent diabetes mellitus. Finland- 
United States Investigation of NIDDM Genetics (FUSION) Study investigators. Hum Hered 
1999; 49: 159-168] and 50% for pulmonary function [ x Wilk JB, Djousse L, Arnett DK, Rich 
SS, Province MA, Hunt SC, Crapo RO, Higgins M, Myers RH: Evidence for major genes 

25 influencing pulmonary function in the NHLBI family heart study. Genet Epidemiol 2000; 19: 
81-94]. These values suggest a range of 0.25 to 0.75 for t R \ we selected t R = 0.6. Choosing a 
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different value of t# changes the relative power of different pooling designs, as shown in Fig. 
11, but does not alter any conclusions regarding the validity of the analytic theory. 

In Fig. 12, the ratio a//a R 2 is varied over 3 orders of magnitude. The QTL has purely additive 
inheritance and the minor allele frequency is 0.1. The pooling fraction has been optimized 
numerically, and linearity in the log-log plot demonstrates validity of the analytic results. 
Inspection of the results shows that agreement extends almost to <j a 2 /o r 2 = 1, where the QTL 
is responsible for half the phenotypic variance, for all the designs except Mahalanobis. The 
Mahalanobis design is less powerful than predicted by analytic theory for a A 2 /a R 2 > 0.05. This 
level of additive variance marks the onset of a major gene effect: carriers of the minor allele 
separating into a clearly resolved affected population, and the association may be identified by 
traditional family-based linkage analysis. 

The allele frequency difference at the significance threshold, z a o^n m , is shown in Fig. 12B 
for the same set of parameters. For the combined design, there are actually two frequency 
differences, one for the pair-mean pools and another for the pair-difference pools. Only the 
difference for the pair-difference pools is shown. As the QTL contribution becomes smaller, 
allele frequency differences must be measured with greater precision. While raw frequency 
differences of 10% are significant for a major gene (c A 2 /(j R ~ 0.1), raw frequencies 
differences of 3% must be measured with little error to achieve maximum power for a 
complex trait with o A 2 /<y R 2 ~ 0.01 . 

The sensitivity of the results to both the allele frequency p and the inheritance mode are shown 
in Figs. 5 and 6. In both of these figures, the pooling fractions are fixed at the limiting values 
0.27 for the unrelated-population, pair-mean, pafr-difference, and sib-combined designs and at 
0.188 for the Mahalanobis design, as would be presumably be done if DNA is pooled once 
then used repeatedly in a genome-wide screen of markers. In Fig. 13, the allele frequency is 
varied for a phenotype with dominant inheritance (Fig. 13 A), additive inheritance (Fig. 13B), 
and recessive inheritance (Fig. 13C) of the minor allele. The QTL contribution o a /<j r is held 
fixed at 0.02 for these comparisons. The figures are shown only for the region p < 0.5 on a log 
scale to highlight the behavior for small values ofp; additive alleles are symmetric about/? = 



48 



WO 02/16643 PCT/US01/25924 

0.5, while dominant major alleles are equivalent to recessive minor alleles and vice versa. It is 
important to note that the displacements a and d are increased to compensate for a smaller 
allele frequency p in order to keep <j A 2 constant and ensure that the limiting behavior for a 
QTL with small effect is a horizontal line. If the displacements had been held constant, then 
5 o/ would decrease linearly with p and the required repository size would increase as lip. 

The repository size is rather insensitive to allele frequency for p > 0.01 for dominant and 
additive inheritance, and for/? > 0.2 for recessive inheritance, for all but the Mahalanobis 
design, indicating that the analytic theory is valid in these regions. The repository size 
1 0 required to detect association increases rapidly as the allele frequency decreases below these 
limits. The Mahalanobis design is more sensitive to the allele frequency than the other 
designs, losing power rapidly as the allele frequency falls below 0.1 for dominant and additive 
inheritance and 0.2 for recessive inheritance. 

1 5 The allele frequency at which the analytic theory loses accuracy may be estimated by noting 
that the perturbation parameters used to derive the theory are the terms \x(G)/<jr. As the 
magnitude of these terms approaches 1, or equivalently when the displacements a or d become 
close to 1, the perturbation theory becomes less reliable. This occurs for p = a^ 2 /8 under 
dominant inheritance, o//2 under additive inheritance, and aw 2/3 /2 for recessive inheritance. 

20 In Fig. 13, these values are 0.0025, 0.01, and 0.14, and accurately identify the elbows of the 
repository size curves. 

In Fig. 14, the mode of inheritance is varied while the allele frequency is held fixed at one of 
three values,/? = 0.5 (Fig. 14A), 0.25 (Fig. 14B), or 0.1 (Fig. 14C). When p = 0.5, the 

25 inheritance mode has virtually no effect on the repository size required to detect association. 
The Mahalanobis design is an exception, with increasing requirements only for strong over- 
dominance. For p < 0.5, the additive variance necessarily vanishes at d = a/(2p-l); when d is 
close to this value, the population requirements increase dramatically. For p = 0.25, this 
occurs at d = -2a . Above this critical value of d, excess A\A\ homozygotes are detected in the 

30 upper pool; below the critical value, excess A\A 2 heterozygotes are detected in the lower pool. 
Although Ap is negative in this region and therefore not significant under a one-sided test of 
allele A\ , a two-tailed test would yield a significant result. The repository size requirements 
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are substantially smaller than predicted by analytic theory for this region of strongly over- 
dominant major alleles. . 

In the bottom panel, Fig. 14C, the allele frequency is p = 0. 1 and the critical value of d is - 
1.125 a . The region of increased population requirements is narrower than in Fig. 14B, and 
becomes narrower still when p is further reduced, but the general behavior is the same. 

Example 4.4 Dependence on type I and type II error rates 

We have also investigated the sensitivity of the exact numerical results to specified rates of 
type I and type II error. In the analytical approximations, this behavior is described entirely by 
the term (z 0 ~z\^pl) 2 . and the optimal pooling fractions are independent of oc and p. Comparison 
with numerical results indicate that the analytical theory is accurate, with no differences seen 
on the scale of the figures previously presented (results not shown). Using the (z a -zi_p) 2 
scaling and specifying a fixed power of 0.8 (z^ - -0.84), for example, a whole-genome scan 
with a = SxlO" 8 (z a = 5.33) requires 1.7x more individuals than a test of 1000 candidate 
markers with a = 5xl0~~ 5 (z a - 3.89) and 6.2x more individuals than a test of a single marker 
with a = 0.05 (z a = 1.64). 

Example 4.5 Tests in the presence of population stratification 

A marker may show spurious association to a phenotype in the presence of a stratified 
population. We consider a simple model for stratification in which a population contains at 
least one sub-population having a mean marker frequency and a mean phenotypic value that 
both deviate from their respective means in the total population. In individual genotyping, 
within-family tests such as the transmission disequilibrium test are known to be robust to this 
type of stratification. Between-family tests, however, may identify spurious associations or 
miss true associations due to stratification effects. 

Tests of pooled DNA in which family members are balanced between pools, such as the pair- 
difference design, are analogous to within-family tests. The value of a A /a R estimated from this 
test is robust to stratification effects. The remaining designs, in particular the pair-mean 
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design, do not balance family members and are subject to stratification. A suitable test for the 
presence of stratification, therefore, is to compare the value of oVct* estimated separately from 
the pair-difference and pair-mean pools with the combined estimator in the form of a % test, 

X 2 = { [Q+-Q] 2 / [sp/2y p 2 N][TJR + ] } + { [Q--Q? I [spI2y p 2 N]\TJR- ] }, 
5 with one degree of freedom. This stratification estimator may also be expressed as 
X 2 - [Q+-Q-f I { [sp/2y p 2 N][T+/R + + TJR- ] } . 

A significant finding for this test, for example at the 0.05 level, indicates that stratification is 
present and that tests other than the pair-difference test may yield spurious results. 

10 Example 4.6 Allele frequency measurement error 

The preceding analysis has assumed that allele frequency measurement errors are negligible. 
Allele frequencies measured by most technologies, including PCR amplification [ M Shaw SH, 
Carrasquillo MM, Kashuk C, Puffenberger EG, Chakravarti A: Allele frequency distributions 

15 in pooled DNA samples: applications to mapping complex disease genes. Gen Res 1998; 8; 
11 1-123], kinetic PCR [ x " Germer S, Holland MJ, Higuchi R. High-throughput SNP allele- 
frequency determination in pooled DNA samples by kinetic PCR. Gen Res 2000; 10; 258- 
266], denaturing high performance liquid chromatography [ 3011 Hoogendoorn B, Norton N, 
Kirov G, Williams N, Hamshere ML, Spurlock G, Austin J, Stephens MK, Buckland PR, 

20 Owen MJ, O' Donovan MC: Cheap, accurate and rapid allele frequency estimation of single 
nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Gen 2000; 
107; 488-493], single-strand conformation polymorphism [ Sasaki T, Tahira T, Suzuki A, 
Higasa K, Kukita Y, Baba S, Hayashi K: Precise estimation of allele frequencies of single- 
nucleotide polymorphisms by a quantitative SSCP analysis of pooled DNA. Am J Hum Gen 

25 2001; 68; 214-218], pyrophosphate sequencing [ OT Alderborn A, KristofTerson A, 
Hammerling U: Determination of single-nucleotide polymorphisms by real-time 
pyrophosphate DNA sequencing. Genome Res 2000; 10; 1249-1258], and mass spectrometry 
[ xvi Buetow KH, Edmonson M, MacDonald R, Clifford R, Yip P, Kelley J, Little DP, 
Strausberg R, Koester H, Cantor CR, Braun A: High-throughput development and 

30 characterization of a genomewide collection of gene-based single nucleotide polymorphism 
markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass 
spectrometry. Proc Nat Acad Sci USA 2001; 98; 581-584], are typically reported with 
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standard errors in the range of 0.01 to 0.02. Assuming a measurement error of 0.01 for pu and 
pLy the resulting error in the population mean estimated asp = (pu + p£)/2 is 0.007. The 
measurement error in p affects the calculated repository size N primarily through the terms o~o 
and a i 2 , which are proportional to p(l-p). The relative error in AT is proportional to 0.001 fp, 
5 less than 10% for minor alleles with frequency greater than 0.07. 

The measurement error in Ap, however, has a more deleterious affect on the test power. Again 
assuming a measurement error of 0.01 for each pool, the measurement error for Ap is 
larger, approximately 0.014. This error can eventually become larger than the sampling error 
10 go 2 /h for large values of n. In this case, the critical value of Ap depends on the measurement 
error, not the sampling error. For example, the magnitude of Ap for a two-sided test with 
significance at the 0.01 level and power 0.95 is (zo.oos-zb.95)x0.014, °r 0.059 using zo.005 - 2.58 
and zo,95 = —1 .64. 

1 5 The allele frequency measurement error also sets a lower limit for the effect size that can be 
detected with a pooled test. For example, using the analytical approximation for Ap for pair- 
mean pools derived in the Appendix, 

Ei(Ap) = (2 yp /p)(Rjn m )cs p v A !o R « 2.6x(\H R y m p(l-p)\a^2p~l)d\ > 0.059, 
where the optimized pooling fraction p = 0.27 is used and the residual variance <jr 2 is 
20 approximated as 1 . For a typical phenotypic correlation between sibs, t R is 0.5, and the effect 
size that can be detected is 
|o-(2/7-l)rf| > 0.028 / p(l-p). 

For additive inheritance and allele frequency of 0.5, the threshold phenotypic displacement a 
is 0.1 1 and the corresponding additive variance is 0.0063. If the rninor allele frequency is 0.1, 
25 the threshold displacement a is 0.3 1 and the corresponding additive variance is 0.01 7. 

In the presence of population stratification, the pair-mean pools may give spurious results and 
pair-difference pools are preferred. Using the expectation for Ap derived in the Appendix for 
pair-difference pools, we require that 
30 E,(Ap) = (2yiJp)(RJT- m )c p a A fo R » 0.86x(l-/ /J )- 1/2 p(l-p)|a-(2p-l)J| > 0.059, 

where p = 0.27 and <j R 2 « 1 as before. For a typical phenotypic correlation between sibs, t R — 
0.5, the effect size that can be detected is 
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\a-(2p-l)d\ > 0.049/p(l-p)- 

For additive inheritance and an allele frequency of 0.5, the critical displacement is 0.20 and 
the additive variance is 0.02. For a rare minor allele,/? = 0.1, and additive inheritance, the 
critical displacement is 0.54, corresponding to an additive variance of 0.05. 

5 

5. Model 3 

In this model techniques similar to those described in Models 1 and 2 are applied to provide 
optimized selection criteria for association studies of pooled DNA using the allele frequency 
difference between pools as a test statistic. It is assumed that samples are drawn from pre- 
1 0 existing population-level DNA repository collected from individuals unselected for any 

narricular nhenorvoe. and that each individual has been measured for a particular phenotype of 

jt r " ' •* *■ ' — 

interest; the goal is to select pools to maximize the power of the test. 

Assuming no experimental error in allele frequency measurements on pooled DNA, we 
1 5 determine the selection thresholds that maximize the power to detect association as a function 
of the frequency, phenotypic displacement, and inheritance mode of a functional 
polymorphism. The genetic parameters are also described in terms of a genotype relative risk 
model. Power calculations are then used to derive the repository size required to detect 
association at specified false-positive and false-negative rates. These calculations are 
20 performed at three decreasing levels of accuracy: exact numerical calculations using the true 
multinomial distribution of the test statistic; numerical calculations based on an approximate 
normal distribution of the test statistic; and analytical approximations accurate for complex 
traits where the polymorphism has a small effect on the phenotype. 

25 Results are depicted in terms of the repository sizes required for three types of experimental 
designs for detecting association with a quantitative phenotype: first, a pooled DNA test using 
a conventional affected/unaffected classification; second, a pooled DNA test of extreme 
individuals using optimized selection thresholds; third, individual genotyping of the entire 
population. We conclude with a discussion of the reduction in power of pooled DNA tests 

30 due to experimental measurement error and with suggestions for effective use of pooled DNA 
tests in practice. 
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5.1 Computational Methods 

The calculation of optimized selection thresholds begins with a model for the genotype- 
dependent distribution of phenotypic values. A quantitative phenotype, denoted^, is 
5 standardized to have unit variance and zero mean. The phenotype is hypothesized to be 
affected by alleles A\ and At, with frequencies p and \~p respectively, at a particular QTL. 
The population frequencies P(G) for genotypes G =A\A\, A\A 2 , and A 2 A 2 are assumed to obey 
Hardy- Weinberg equilibrium. Using standard notation for a variance components model, the 
effect He of genotype G on phenotype Xis a, d and -a, for genotypes A\A\ 9 A\A 2i and A2A2 
10 respectively. These displacements are each offset by subtracting (2p—l)a + 2p(l—p)d to 
preserve the overall phenotype mean of zero. 

The inheritance mode of the QTL is represented by the displacement d of the heterozygote, for 
example purely recessive (d = -a), additive (d - 0), or dominant (d = +d) inheritance. The 
1 5 inheritance mode partitions the phenotypic variance due to the QTL into the additive variance 
a a and the dominance variance a D 2 9 with 
a A 2 + o D 2 = 2p(l-p)[a-d(2p-\)] 2 + 4p*(l-pf#. 

This partitioning is important because, as will be seen below, pooled tests are sensitive 
primarily to the additive component of variance. Note that the additive component may be 
20 large even when the inheritance is purely dominant or recessive. The contributions to the 

phenotype from remaining genetic and environmental factors are assumed to follow a normal 

distribution with residual variance o* 2 , 

a* 2 =l-(a/W). 

25 The genotype-dependent phenotype distributions for each genotype are 
P{X\G) = (2%<T R 2 )- ia zxpHX-itef /2c A 

normal distributions centered at \i G with width 0*. The overall phenotype distribution is the 
weighted sum of the distributions from each genotype, 
P(X) = XgP(X\G)P(G). 

30 For a complex trait in which the QTL makes a small contribution, the three underlying 
distributions may be unresolved in the observed P(X). 
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This variance components model may be connected to an equivalent affected/unaffected 
genotype relative risk model by specifying a threshold phenotypic value Xr that classifies 
individuals as affected (X> Xt) or unaffected (X<Xj). The proportion r of the total 
population that is affected is the overall risk or disease prevalence; the probability that an 
5 individual with genotype G is affected, divided by the corresponding probability for an 
individual with genotype A 2 A 2 , is the genotype relative risk. 

In the tests of pooled DNA considered here, a sample repository of total size N serves as the 
source of DNA to be selected for one of two pools; not every individual need be selected. The 

1 0 test statistic is the difference in the frequency that a particular allele, here always assumed to 
be A\, occurs in the two pools. For a quantitative phenotype, it is natural to specify an upper 
threshold Xu and a lower threshold Xl as the selection criteria Individuals with phenotypic 
values above Xu are selected for the upper pool; individuals with phenotypic values below Xl 
are selected for the lower pool; and individuals with phenotypic values between Xl andXy are 

15 not pooled at all. The number of individuals selected for each pool is pN. The fraction p 
expressed in terms of Xu is 

which is solved numerically to determine Xu- The genotypes of individuals selected by X> Xu 
follow a multinomial distribution; the probability &u(G) that an individual selected for this 
20 pool has genotype G is 0[-(Xu~\XGy<yR]P(G)/p, A multinomial distribution is defined 
similarly for the lower pool, 
1 = 2 G 0 Z (G) = p-'Zc <Z>[(X L -\x G yoR]P(G) , 
using the lower threshold X L , 

25 A pooling design based on an affected/unaffected classification is similar: affected individuals 
are selected for the upper pool; an equivalent number of suitably matched unaffected 
individuals are selected for the lower pool. The selection thresholds Xu and Xl are identical to 
the classification threshold Xr- The relative risk for genotype G s expressed in terms of the 
pooling threshold, is [9^G)/P(G)]/ \QiAA 2 A 2 )IP(A2A 2 )\. 

30 

The repository size N required to detect association between genotype G and either the 
quantitative phenotype ^or the affected/unaffected classification depends on the desired type I 
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error rate a and type II error rate p, the chosen test statistic, and the experimental design, as 
well as on the underlying genetic model. For a one-sided test of a single marker, a = 1 - <I>(z a ) 
and 1-p = G>(-zi_p) , where <P(z) is the cumulative probability distribution for standard normal 
deviate z. For a genome scan, the values a = 5x 10" 8 (z a = 5.33) and 1-p = 0.8 = -0.84) 
have been suggested. 5 The null hypothesis is denoted H 0 with all u G equal to zero, and the 
alternative hypothesis is denoted H\ with at least one non-zero [i G . 

An exact calculation of the repository size required to attain desired error rates for a specified 
genetic model proceeds as follows. First, a value of the pooling fraction p or the disease 
prevalence r is selected. A trial repository size N is specified, with the number of individuals 
n selected per pool set to the integer part of pN or rN. Next, the probability Po(ijJc) of 
selecting i individuals with genotype A\A u j individuals with genotype A\A 2 , and k individuals 
with genotype A2A2, with i+j+k equal to «, is tabulated using the multinomial distribution 

The frequency of allele A\ for this pool composition is (2/ + j)/2n. The probability that two 
pools selected in this manner differ in frequency by at least Ap is calculated as the sum of 
p o(ijJtiP<Aj>j'Jf) for all combinations of ijjc and i'/Jd where 
[2(i-r) + (/-/)]/2«>Ap. 

Significance at level a is attained by increasing Ap until this sum is less than or equal to a. If 
not even the maximum value Ap — 1 is sufficient for significance at level a, then a larger value 
of N is selected for the current value of p and the calculation begins anew. Otherwise, 
multinomial probabilities for pool compositions are calculated under Hi using 

Pittm « ["!/O*bU!)]e^^i) , 'e^^2y0^2^2)* 

for the upper pool and a similar term PrfffJ?), with 0£ replacing Gc/, for the lower pool. The 
probability that the allele frequency difference between the upper and lower pools is at least 
Ap is obtained as the sum of PtKhf&PifffM for all compositions ijjc and i where [2(i- 
O + (J~fW2n > Ap. If this probability is greater than or equal to p, the current Wis feasible 
for type I error a and type II error p and a smaller value for N is attempted. This process 
continues until the smallest feasible N is found. 
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For the affected/unaffected design, this procedure is followed for each value of r. For the tail 
pool design, the smallest feasible value for AT is calculated as a function of p, and the entire 
design is optimized by searching for the pooling fraction p with the smallest feasible N. 

5 When each pool contains a large number of individuals and many copies of each allele, the 
distribution of allele frequencies for the pool approaches a normal distribution. The difference 
in allele frequencies between pools, which continues to serve as the test statistic, approaches a 
normal distribution as well. The pool sizes required to achieve specified error rates are 
obtained accurately in this case by approximating the multinornial distributions of allele 
1 0 frequencies as normal distributions. Under Ho, the mean of the test statistic is zero and the 
variance is a^ln — p(l-p)/n, derived by noting that the variance of the frequency difference is 
twice the variance of the mean for a single pool of n individuals. The allele frequency 
variance for an individual is p(l-p)/2, and averaging over the n individuals reduces the 
variance by the factor n. 

15 

Under H\ 9 the expected allele frequency difference Ap is 
hp=pv-PL = %g[Qu(G)-Ql(G)] pa, 

where the genotype-dependent allele frequency p G is 1 for G~A\A U 0.5 for A X A 29 and 0 for 
A 2 A 2 . The variance is cti 2 /w, where c?i 2 is obtained from the multinomial distribution, 
20 ai 2 = £<? [Qu(G) + Ql(G)]Pg 2 - (pu +Pl\ 

The repository size TV required for type I error a and power 1-p is 
n = [za&b — zi-pcri] 2 /Ap 2 . 

For tail pools, p is then varied to find the smallest N. 

25 The normal approximation underestimates the repository size requirement relative to the exact 
results from the multinomial distribution. When the sum of the alleles in both pools is at least 
60, the difference in repository sizes is no greater than 5%. We chose 60 alleles in both pools 
as the criterion for switching from the multinomial to the normal calculation. Standard 
algorithms were employed to perform the root search for^Tu andX L , the optimization, and the 

30 integration over the tail of a normal distribution. 
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In the regime of typical complex traits, the effect of any single QTL is small, the residual 
variance <3r is nearly 1, and analytical results may be obtained by expanding Ap to second 
order in the effect size \x G . This corresponds loosely to a perturbation theory for probability 
distributions. The Ap expansion in turn requires a Taylor series expansion for 0(z), 
5 <D(z-8) = 0>(z) - 5 (d/dz) <D(z) + (1/2)5 2 (d/dz) 2 <J>(z), 
truncated at second order. The first derivative is 

(d/dz) (2ny in J dr'exp(-z 2 /2) = (27c)- 1/2 exp(-z 2 /2)^>;, 

where y is the height of the normal distribution at normal deviate z, and the second derivative 
is 

10 (d/dz) (27i)* 1/2 exp(-r : /2) - ~yz. 
Summing these terms, 
0(z-6) = <P(z) - yS-(\ /2)\zd \ 

Substituting this approximation into the expressions for 6(G) using 5 = p<?/a/e and z = 0 -1 (l- 
15 p) yields for the tail design 

Pu=P + (y/pcrR){ZGP(G)pGlic} + (yWlpo-R^&GPiGipGHG 2 } and 
PL =P - (y/p<7 R ){LcP(G)pG^c} + (y\z}/2pa R 2 ){TGP(G)p G ii G 2 }. 

The corresponding expressions for the affected/unaffected pools, with z — O'^l-r), are 
pu=p + [y/ra*] {ZcP(G)pc\ic} + [ylzl/2^ 2 ] {ZgP(G)PgVg 2 } and 
20 p L =p - [y/(l-r)a 5 ]{£cP(G)pGUc} - [ym^~rW]{XGP(G)p G iL G 2 } . 
The required sums are 
XqP(G)p41q = o A {p(l-p)/2] l/2 , and 

%gP(G)p g \ig 2 - (l/2)(l-a* 2 ) - Ap\\~p) 2 ad + (2p-\)<s D 2 j2 » a//2. 

The approximate value a^ 2 /2 for the second sum neglects the dominance variance and is exact 
25 . for purely additive inheritance. It serves to simplify the final equations for Ap. Little error is 
made in the resulting Ap for two reasons: first, even with dominant or recessive inheritance, 
the additive variance is often larger than the dorninance variance; second, this factOT is part of 
a correction term that is already small. 

30 The results for Ap are 
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Ap = 2 ya O 0 A /pcr R , tail pools, and 

Ap = [1+ 0~\l-r)tfV2 3/2 croa*] ya Q a A /2 m r(l-r)<y R , affecteoVunaffected pools. 
To the same order of approximation, o~i 2 may be equated with a 0 2 , and the number of 
individuals required per pool is 
5 n = [z a - 2i-p] 2 ^o 2 /AP 2 . 

The preceding three equations lead directly to our main results, Eqs. 1 and 2. 

The perturbation theory above is valid when the expansion parameters \ig/or are small, 
typically satisfied when o A 2 /2p(l-p) is smaller than 1. In this regime, approximate genotype 
10 relative risks may be obtained from the Taylor series expansion for 9(G). To lowest order, the 
relative risk for the heterozygote is 1 + (d^a)y/rxjR, and for the AiA\ homozygote is 
1 + 2ay/ra R . For additive inheritance, d = 0, and the relative risk is multiplicative with allele 
dose when ay/ra R is small. 

15 If individual genotypes are measured for the N individuals in the population, the regression 
coefficient b\ in the regression model 
X=bx(p G -p) + e 

is a suitable test statistic. The residual contribution s to the phenotype has mean zero and is 
uncorrelated with po- Under Ho, b\ has mean zero and variance 
20 Var(£i|tf 0 ) = N~ l Vax(X)/Var(p G ) = VN\p(l-p)/2l 

Under Hi , the expected value and the variance of b\ are 
E(fci|tfi) = Cov(X,^ G )/Var(^ = a^[p(l-p)/2] 1/2 and 
Var^ltfi) - AT 1 Var( £ )/Van> G ) = c R 2 /N [p(l-p)/2]. 

The repository size required for a one-sided test of b\ with Type I error a and power 1-fJ is 

25 N= [ZaVar^ltfo) 172 ~ z l ^V2oc{b 1 \H0 m fm^0]\ 
which is presented in simplified form as Eq. 3. 

Example 5.1 

30 Two experimental designs are considered using DNA pooled from individuals selected from a 
pre-existing repository of JV samples: affected/unaffected pools, with DNA pooled from n 
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affected and n unaffected individuals; and tail pools, with DNA pooled from the n most 
extreme individuals at each tail of the phenotype distribution. 

For the affected/unaffected design, the expected number of affected individuals is n = rN, and 
5 an additional n suitably matched controls are selected from the remainder of trie population. 
An analytical approximation for the repository size is 

AWunaff = [za-zm] 2 [<*W] ■ 2r(l-r) 2 / {y r 2 [1 +€T*(l^)a J ^a&™(l-p) l ' 1 f} 9 (Eq. 
1) 

where y r is the height of the standard normal distribution at 0~\r) (see Materials and Methods 
10 for derivation). Repository size requirements are minimized with a prevalence of 50%, much 
larger than values realistic for complex disorders. 

The tail pools are parameterized by the fraction p — nIN of population N selected for each 
pool. An analytical approximation for the repository size is 
15 Nnsx = [z a -z x ^f [a/eW] * p/2y p 2 , (Eq. 2) 

where y p is the height of the standard normal distribution at 0~ ! (p) (see Materials and Methods 
for derivation). The design is optimized by selecting p to rninimize p/2y p 2 and hence N^. 
The optimal fraction, 27.03%, is independent of all remaining parameters. 

20 The repository size required to achieve the same error rates using individual genotyping is 
^indiv = [z a ~ zi-$or] 2 la/, (Eq. 3) 

based on a regression model of phenotypic value on allele dose (see Materials and Methods for 
derivation). 

25 Results of the analytical approximations are shown in Fig. 15 with individual genotyping 

serving as a reference. The tail design, with p = 27% of the population selected for each pool, 
requires a repository only 1 .24x larger than required for individual genotyping. It is also 
robust to variation in p near its optimum, as values from 19% to 37% drop the efficiency no 
more than 5%. In contrast, for 10% disease prevalence, the affected/unaffected design 

30 requires a repository 5.3 x larger than that required for individual genotyping and is 4x less 
efficient than the tail design. 
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The effect of varying the inheritance mode is shown in Figure 16 for tail pools. In this . 
example, the type I error is 5X10" 8 , the type II error is 0.2, and the displacement a is 0.25 in 
units of the phenotypic standard deviation. The heterozygote displacement d varies from -a, 
pure recessive inheritance, to +a, pure dominant inheritance. Results are shown for three 

5 frequencies of allele A \ : p = 0.5, 0. 1 , and 0.01 . Solid lines correspond to exact numerical 
calculations. In the top panel showing the repository size N, filled circles correspond to 
analytical approximations, Eq. 1, and are virtually indistinguishable from exact calculations. 
Whenp = 0.5, Ay and A 2 have equal frequencies, the additive variance is 0.03125, and the 
dominance variance is 0 regardless of inheritance mode. Since the population requirements 

10 depend primarily on the additive variance, N is independent of the inheritance mode. For 
allele frequencies below 0,5, the additive variance increases from left to right and the 
population requirements decreases. The maximum population is required when d equals 
a/(2p-l), which always falls outside the range depicted. The bottom panel depicts the 
corresponding values of p from the numerical calculations. The optimal pooling fractions fall 

15 in a narrow range from 24.5% to 27.5%, close to the analytical approximation of 27.03%. 

The effect of varying the additive variance directly, or equivalently the genotype relative risk 
for an allele of known frequency, is shown in Fig. 17. The top panel of Fig. 17 shows that 
analytical approximations for N from Eqs. 1 and 2 (solid circles) are nearly ^distinguishable 

20 from the exact numerical results (dashed and solid lines) when the genotype relative risk is 
below a factor of 2 to 3. Type I and II error rates are SxlO" 8 and 0.2 respectively, and the 
allele frequency is 0. 1 . The bottom panel shows the corresponding allele frequency difference 
that must be measured for a significant finding with a test of pooled DNA. For example, 
alleles carrying a 1.5x heterozygote relative risk, corresponding to an additive variance of 

25 0.01 , have a raw frequency difference of 0.04 at significance: the upper pool has an allele 
frequency of 0. 12 and the lower pool a frequency of 0.08. The population size required to 
achieve significance is 4700, with 1270 individuals selected per pool. 

This analysis assumes that allele frequency measurement error is negligible. Allele 
30 frequencies measured by most technologies, including PCR amplification, kinetic PCR, 
denaturing high performance liquid chromatography, single-strand conformation 
polymorphism, pyrophosphate sequencing, and mass spectrometry, are typically reported with 
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standard errors in the range of 0.01 to 0.02. Assuming a measurement error of 0.01, the 
measurement error in the frequency difference is larger by a factor of V2, yielding a final error 
of 0.014. Based on the measurement error, the allele frequency difference of 0.04 in the 
example above corresponds to a z-score of 2.86 and a type I error rate of 0.002. 

5 

While this error rate is much larger than the error rate of 5x1 0" 8 required for a whole-genome 
scan, a practical solution is to employ pooled allele frequency measurements as a pre-screen; 
candidate associations identified by the pre-screen may then be confirmed by individual 
genotyping of the entire population, or possibly just the extreme tails. Setting a type I error 
1 0 rate for the pre-screen of 0.0 1 (z-score of 2.33), corresponding to an allele frequency 

difference of 0.033, implies a lOOx savings over an equivalent study that does not employ a 
pre-screen. 

This experimental limitation sets a threshold for the effect size that may be identified in a 
1 5 pooled DNA pre-screen. The relationship between the expected value of Ap and the 
parameters of the genetic model for a SNP with purely additive inheritance is 
Ap = 2.44x[z a /(z a -zi_p)]p(l-^)a, 

where the initial factor of 2.44 arises from the optimized pooled tail design, z a and zi_p 
correspond to the type I and II errors that would be obtained neglecting measurement error, 

20 and a is the phenotypic displacement as before. For use in a pre-screen with a p- value of 0.01 
from measurement error alone, z« = 2.33 is reasonable. To retain at least 95% of the true 
associations, P should be no greater than 0.05, with zi_p = -1 .64. These parameters yield hp 
equal to 1 A3xp(l-p)a, oxp(l-p)a = 0.023 for the 0.033 frequency difference threshold. For a 
minor allele frequency of 0.1, this corresponds to a displacement a of 0.26 and an additive 

25 variance of 0.012; for allele frequencies of 0.5, the displacement is 0.092 and the additive 
variance is 0.0042. Thus, the pre-screen retains the power to detect markers with additive 
variance down to 0.5% to 1.5%, depending on the marker frequency. 
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Tables 

Table L Sib-pair genotype probabilities 
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Table EL Pooling Designs 
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Table HI. Sib-pair genotype probabilities 
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OTHER EMBODIMENTS 

While the invention has been described in conjunction with the detailed description 
thereof, the foregoing description is intended to illustrate and not limit the scope of the 
invention, which is defined by the scope of the appended claims. Other aspects, advantages, 
and modifications are within the scope of the following claims. 
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What is claimed is: 

1. A method for detecting an association in a population of individuals between a 
genetic locus and a quantitative phenotype, wherein two or more alleles occur at the locus, and 
wherein the phenotype is expressed using a numerical phenotypic value whose range falls 
within a first numerical limit and a second numerical limit, the method comprising the steps of 

a) obtaining the phenotypic value for each individual in the population; 

b) selecting a first subpopulation of individuals having phenotypic values that are 
higher than a predetermined lower limit and pooling DNA from the individuals in the first 
subpopulation to provide an upper pool; 

c) selecting a second subpopulation of individuals having phenotypic values that are 
lower than a predetermined upper limit and pooling DNA from the individuals in the second 
subpopulation to provide a lower pool; 

d) for one or more genetic loci, measuring the difference in frequency of occurrence of 
a specified allele between the upper pool and the lower pool; and 

e) determining that an association exists if the allele frequency difference between the 
pools is larger than a predetermined value. 

2. The method described in claim 1 wherein the lower limit and the upper limit are 
chosen such that, for a specified false-positive rate, the frequency of occurrence of false- 
negative errors is minimized. 

3. The method described in claim 1 wherein the population comprises unrelated 
individuals. 

4. The method described in claim 1 wherein the population comprises related 
individuals. 

5. The method described in claim 3 wherein the predetermined lower limit is set so that 
the upper pool includes the highest 35% of the population and the predetermined upper limit is 
set so that the lower pool includes the lowest 35% of the population. 



72 



WO 02/16643 



PCT/U SOI 725924 



6. The method described in claim 3 wherein the predetermined lower limit is set so that 
the upper pool includes the highest 30% of the population and the predetermined upper limit is 
set so that the lower pool includes the lowest 30% of the population. 

7. The method described in claim 3 wherein the predetermined lower limit is set so that 
the upper pool includes the highest 27% of the population and the predetennined upper limit is 
set so that the lower pool includes the lowest 27% of the population. 

8. The method described in claim 2 wherein the individuals in the population are 
sibling pairs and each pair is ranked according to the phenotypic values of the siblings in each 
pair, and either (i) both members of the sibling pair are selected for the upper pool; (ii) both 
members of the sibling pair arc selected for the lower pool; or (iii) neither member of the 
sibling pair is selected. 

9. The method described in claim 8 wherein each sibling pair is ranked according to a 
mean value of the phenotypic values of the siblings in each pair, and wherein both members of 
the sibling pair are in the same pool. 

10. The method described in claim 8 wherein the phenotypic values of the siblings in 
each pair are both above a predetermined lower limit or both below a predetermined upper 
limit. 

11. The method described in claim 8 wherein the predetermined lower limit is set so 
that the upper pool includes the pairs with the highest 10% of the mean values in the 
population and the predetermined upper limit is set so that the lower pool includes the lowest 
10% of the mean values in the population. 

12. The method described in claim 8 wherein the predetermined lower limit is set so 
that the upper pool includes the pairs with the highest 15% of the mean values in the 
population and the predetennined upper limit is set so that the lower pool includes the lowest 
15% of the mean values in the population. 
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13. The method described in claim 8 wherein the predetermined lower limit is set so 
that the upper pool includes the pairs with the highest 20% of the mean values in the 
population and the predetermined upper limit is set so that the lower pool includes the lowest 
20% of the mean values in the population. 

14. The method described in claim 8 wherein the predetermined lower limit is set so 
that the upper pool includes the pairs with the highest 25% of the mean values in the 
population and the predetermined upper limit is set so that the lower pool includes the lowest 
25% of the mean values in the population. 

15. The method described in claim 8 wherein the predetermined lower limit is set so 
that the upper poo! includes the pairs with the highest 27% of the mean values in the 
population and the predetermined upper limit is set so that the lower pool includes the lowest 
27% of the mean values in the population. 

16. The method described in claim 2 wherein all individuals in the population are 
members of sibling pairs, and either (i) one member of a sibling pair is selected for the upper 
pool and the second member of the sibling pair is selected for the lower pool; or (ii) neither 
member of a sibling pair is selected. 

17. The method described in claim 17 wherein the sibling pairs are ranked by the 
absolute magnitude of the difference in phenotypic value for the siblings within each pair, the 
percent of pairs with the greatest difference are identified, and the siblings in each pair are 
distributed such that the sibling with the high phenotypic value is selected for the upper pool 
and the sibling with the low phenotypic value is selected for the lower pool. 

1 8. The method described in claim 17 wherein the phenotypic value of one member of 
the sibling pair is above a predetennined lower limit and the phenotypic value of the second 
member of the sibling pair is below a predetemnned upper limit. 

19. The method described in claim 17 wherein the percent of pairs is 80% and the 
distribution provides 10% of the population in each pool. 
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20. The method described in claim 17 wherein the percent of pairs is 70% and the 
distribution provides 15% of the population in each pool. 

21 . The method described in claim 17 wherein the percent of pairs is 60% and the 
distribution provides 20% of the population in each pool. 

22. The method described in claim 17 wherein the percent of pairs is 50% and the 
distribution provides 25% of the population in each pool. 

23. The method described in claim 17 wherein the percent of pairs is 54% and the 
distribution provides 27% of the population in each pool. 

24. The method described in claim 2 wherein the individuals in the population are 
sibling pairs and the results obtained by performing the methods described in claims 7 and 15 
are combined. 

25. The method described in claim 3 wherein the population of unrelated individuals 
are provided by a process comprising the steps of: 

a) providing a population of sibling pairs; and 

b) selecting only one member of a sibling pair to be included in the population of 
unrelated individuals. 

26. The method described in claim 25 further comprising the steps of : 

a) calculating the overall mean of the phenotypic values in the population; 

b) subtracting the mean from each phenotypic value; 

c) ranking each sibling pair according to the result of the calculation conducted 
according to 

(pair-mean) 2 /(variance of pair-mean) + (pair-difference) 2 /(variance of pair difference) 
to provide the Mahalanobis rank; 

d) identifying a more extreme sibling from each sibling pair as the member of the pah- 
having a greater magnitude of the phenotypic value; and 

e) from sibling pairs having extreme Mahalanobis ranks constructing pools using the 

sibling of the pair having the more extreme phenotypic value. 
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27. The method described in claim 25, further comprising the steps of: 

a) calculating the overall mean of the phenotypic values in the population; and 

b) selecting that member of each sibling pair having a phenotypic value such that the 
absolute value of the difference between the individual* s phenotypic value and the overall 
mean is greater than the difference for the other individual in the pair, 

thereby providing a population of unrelated individuals. 

28. The method described in claim 25 further comprising the steps of: 

a) rank ordering the members of the population of sibling pairs to generate a list 
wherein the rank order of each member of a sibling pair is obtained as the smaller of: 

i) the distance from the first member on the list and 

ii) the distance from the last member on the list; and 

b) selecting that member of each sibling pair having a lower ranking; 
thereby providing a population of unrelated individuals. 

29. The method described in claim 25 further comprising the steps of: 

a) rank ordering the members of the population of sibling pairs to generate a list 
wherein the rank order of each member of a sibling pair is obtained as the distance from the 
phenotype mean; and 

b) selecting that member of each sibling pair having a lower ranking; 
thereby providing a population of unrelated individuals. 

30. The method described in claim 1 wherein the population includes individuals who 
may be classified into classes. 

3 1 . The method described in claim 30 wherein the classes are based on an age group, 
gender, race or ethnic origin. 

32. The method described in claim 31 wherein all the members of a class are included 
in the pools. 
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33. The method described in claim 1 for determining the genetic basis of disease 
predisposition. 

34. The method described in claim 33, wherein the genetic locus which is analyzed for 
determining the genetic basis of disease predisposition contains a single nucleotide 
polymorphism. 
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